Tasks

Task 1: Block Builder

磁盘上结构最基础的单位就是 block。一个 block 的大小通常是 4KB（可能会因存储介质不同发生变化），与操作系统的页面大小一致。一个 block 内存储了有序键值对，一个 SST（Sorted String Table）由多个 blocks 组成。当 memtable 的数量到达系统限制时，memtable 会刷新为 SST。

一个 block 的结构如下：数据区存储若干个 entry，偏移区存储这些 entry 的偏移量，最后是一个额外的元素数目。

Text

----------------------------------------------------------------------------------------------------
|             Data Section             |              Offset Section             |      Extra      |
----------------------------------------------------------------------------------------------------
| Entry #1 | Entry #2 | ... | Entry #N | Offset #1 | Offset #2 | ... | Offset #N | num_of_elements |
----------------------------------------------------------------------------------------------------

而每个 entry 的结构如下：key_len 和 value_len 是固定的 2B，也就意味着 key 和 value 的长度最大是 \(2^{16} - 1 = 65535\)。内部的存储可以用 u16 数据类型进行表示。

Text

-----------------------------------------------------------------------
|                           Entry #1                            | ... |
-----------------------------------------------------------------------
| key_len (2B) | key (keylen) | value_len (2B) | value (varlen) | ... |
-----------------------------------------------------------------------

了解了 block 的结构，我们就可以开始实现 block。

首先是 block 的构造 BlockBuilder。它含有四个成员变量：offset 是每个 entry 的偏移量， data 是存储 entry 的数据，block_size 是 block 的最大大小，first_key 指向第一个键。

/// Creates a new block builder.
pub fn new(block_size: usize) -> Self {
    Self {
        offsets: Vec::new(),
        data: Vec::new(),
        block_size,
        first_key: KeyVec::new(),
    }
}

然后是将一个键值对插入到 block 中的 add 方法。要求给定的键不为空。 Block 的预估大小为 entry data 和 offset 的长度，以及额外表示元素数量的开销。新的 entry 的大小为 \(2B + keylen + 2B + valuelen\)，其中 \(keylen\) 和 \(valuelen\) 分别是 key 和 value 的长度。如果大小超过了限制并且当前 block不为空，则返回 false，表示插入失败。否则，将偏移量、键长度、键、值长度、值依次写入 data，并更新 offsets。如果 first_key 是空，也就意味着这是第一个键，将其记录下来。插入完成返回 true。

fn estimate_size(&self) -> usize {
    super::SIZEOF_U16 + self.data.len() + self.offsets.len()
}

/// Adds a key-value pair to the block. Returns false when the block is full.
/// You may find the `bytes::BufMut` trait useful for manipulating binary data.
#[must_use]
pub fn add(&mut self, key: KeySlice, value: &[u8]) -> bool {
    assert!(!key.is_empty(), "key must not be empty");

    if self.estimate_size() + key.len() + value.len() + super::SIZEOF_U16 * 2 > self.block_size
        && !self.is_empty()
    {
        return false;
    }

    self.offsets.push(self.data.len() as u16);
    self.data.put_u16(key.len() as u16);
    self.data.put(key.raw_ref());
    self.data.put_u16(value.len() as u16);
    self.data.put(value);

    if self.first_key.is_empty() {
        self.first_key = key.to_key_vec();
    }

    true
}

判断是否为空可以通过检查 offsets 是否为空。

/// Check if there is no key-value pair in the block.
pub fn is_empty(&self) -> bool {
    self.offsets.is_empty()
}

最后是构建返回一个 block：

/// Finalize the block.
pub fn build(self) -> Block {
    if self.is_empty() {
        panic!("block is empty")
    }
    Block {
        data: self.data,
        offsets: self.offsets,
    }
}

Block 的编码，就是将内部的数据转换成上面展示的字节串。也就是依次将 data，offset 和 num_of_elements 以 u16 写入 buf 中，并转换成 Bytes 返回。

pub fn encode(&self) -> Bytes {
    let mut buf = self.data.clone();
    let offest_len = self.offsets.len();

    for offset in &self.offsets {
        buf.put_u16(*offset);
    }
    buf.put_u16(offest_len as u16);
    buf.into()
}

而解码就是相反的过程，将给定的数据转换成 Block。首先读取原始数据最后 2B 的数据，得到元素的个数。原始数据的长度减去 num_of_element 的长度，再减去元素个数乘单个偏移量的长度，就是 entry 数据的末端。从 entry 数据的末端开始，到 num_of_element 的开始，就是偏移量。将偏移量部分数据按照 u16 读取，得到偏移量数组。最后将偏移量之前的部分作为 entry 数据，转换为向量。

pub fn decode(data: &[u8]) -> Self {
    let entry_offset_len = (&data[data.len() - SIZEOF_U16..]).get_u16() as usize;
    let data_end = data.len() - SIZEOF_U16 - entry_offset_len * SIZEOF_U16;
    let offsets_raw = &data[data_end..data.len() - SIZEOF_U16];
    let offsets = offsets_raw
        .chunks(SIZEOF_U16)
        .map(|mut x| x.get_u16())
        .collect();
    let data = data[..data_end].to_vec();
    Self { data, offsets }
}

Task 2: Block Iterator

在实现了 block 之后，顺理成章地也要实现一个 iterator。

首先是比较简单的，获取当前的 key，value，以及判断是否有效。是否有效可以通过 key 是否为空来判断。

/// Returns the key of the current entry.
pub fn key(&self) -> KeySlice {
    self.key.as_key_slice()
}

/// Returns the value of the current entry.
pub fn value(&self) -> &[u8] {
    &self.block.data[self.value_range.0..self.value_range.1]
}

/// Returns true if the iterator is valid.
pub fn is_valid(&self) -> bool {
    !self.key.is_empty()
}

有两个函数 seek_to_first 和 next。seek_to_first 是找到 block 中的第一个 key，而 next 是指向下一个，也就是当前 idx 增加 1 的位置。两个可以共用一个 seek_to 函数，前往指定 idx 的位置。

/// Seeks to the first key in the block.
pub fn seek_to_first(&mut self) {
    self.seek_to(0)
}

/// Move to the next key in the block.
pub fn next(&mut self) {
    self.seek_to(self.idx + 1);
}

seek_to 函数给定需要前往的 idx。首先需要判断 idx 是否超过了偏移量的长度。如果超过，将 iterator 的 key 清空，value 范围设置为 \((0, 0)\)。否则，从偏移量数组中取出当前 idx 的偏移量，并通过偏移量取得对应的 entry。 entry 的前两个字节是 key 的长度，接着是 key，再接着是 value 的长度，最后是 value。 value 的范围就是 \((offset + 4 + keylen, offset + 4 + keylen + valuelen)\)。

fn seek_to(&mut self, idx: usize) {
    if idx >= self.block.offsets.len() {
        self.key.clear();
        self.value_range = (0, 0);
        return;
    }

    let offset = self.block.offsets[idx] as usize;

    let mut entry = &self.block.data[offset..];
    let key_len = entry.get_u16() as usize;
    let key = &entry[..key_len];
    self.key.clear();
    self.key.append(self.first_key.raw_ref());
    self.key.append(key);
    entry.advance(key_len);

    let value_len = entry.get_u16() as usize;
    let value_begin = offset + SIZEOF_U16 + SIZEOF_U16 + key_len;
    let value_end = value_begin + value_len;
    self.value_range = (value_begin, value_end);

    entry.advance(value_len);

    self.idx = idx;
}

而 seek_to_key 函数需要查找第一个不小于给定键的 key。最简单的做法就是顺序查找。这里使用了二分查找降低时间复杂度。

pub fn seek_to_key(&mut self, key: KeySlice) {
    let mut low = 0;
    let mut high = self.block.offsets.len();
    while low < high {
        let mid = low + (high - low) / 2;
        self.seek_to(mid);
        assert!(self.is_valid());

        match self.key().cmp(&key) {
            Ordering::Less => low = mid + 1,
            Ordering::Greater => high = mid,
            Ordering::Equal => return,
        }
    }
    self.seek_to(low);
}

至此，就可以实现从第一个 key 开始的 iterator，以及从不小于给定键的第一个 key 开始的 iterator。

/// Creates a block iterator and seek to the first entry.
pub fn create_and_seek_to_first(block: Arc<Block>) -> Self {
    let mut iter = Self::new(block);
    iter.seek_to_first();
    iter
}

/// Creates a block iterator and seek to the first key that >= `key`.
pub fn create_and_seek_to_key(block: Arc<Block>, key: KeySlice) -> Self {
    let mut iter = Self::new(block);
    iter.seek_to_key(key);
    iter
}

Result

Test Your Understanding

What is the time complexity of seeking a key in the block?
如果是顺序查找，时间复杂度就是 \(O(n)\)；我们实现的是二分查找，因此时间复杂度降低为 \(O(\log n)\)。
Where does the cursor stop when you seek a non-existent key in your implementation?
如果给定的 key 不存在，那么 cursor 会指向第一个大于给定 key 的 key。
So Block is simply a vector of raw data and a vector of offsets. Can we change them to Byte and Arc<[u16]>, and change all the iterator interfaces to return Byte instead of &[u8]? (Assume that we use Byte::slice to return a slice of the block without copying.) What are the pros/cons?
可以。
优点：将生命周期解耦；Arc 支持并发读取，线程安全；Arc<[u16]> 内存消耗比 Vec<u16> 低，Bytes 切片共享底层数据，避免大内存的拷贝。
缺点：只要存在一个 Bytes 切片，block 就无法释放，可能长期占用大内存；原子操作增减引用计数导致性能降低。
What is the endian of the numbers written into the blocks in your implementation?
大端序。
Is your implementation prune to a maliciously-built block? Will there be invalid memory access, or OOMs, if a user deliberately construct an invalid block?
不会，在访问的时候先行判断 key 是否为空，是否是 valid 的。
Can a block contain duplicated keys?
不可以，因为 key 是有序的，如果允许重复的 key，那么在查找时，无法确定应该返回哪一个 key。
What happens if the user adds a key larger than the target block size?
如果当前 block 不为空，会返回错误。
Consider the case that the LSM engine is built on object store services (S3). How would you optimize/change the block format and parameters to make it suitable for such services?
可以从以下方面考虑：
增加每个 block 的大小，减少请求次数；每个块可以增加最大/最小值信息，快速过滤不相关的块；将多个请求合并等。

Prince's Blog

Day3: Block

Tasks

Task 1: Block Builder

Task 2: Block Iterator

Result

Test Your Understanding