Week1 Day1 的任务： 1. 实现基于跳表的 memtable

实现冻结 memtable 的逻辑
实现 LSM 读 memtable 的 get 操作

Tasks

Task1: SkipList Memtable

虽然说是基于跳表的 Memtable，但是并不需要我们从头去实现跳表，直接使用的 crossbeam_skiplist。

这个任务的主要目标是实现简单的 put 和 get 操作。Memtable 没有 delete 方法，而是用 put 空值来代替 delete。put 的时候，如果 key 已经存在，则覆盖。

首先先补充 create 方法，给定 memtable 的 id，其他的字段都初始化为默认值。

/// 创建一个 memtable
pub fn create(_id: usize) -> Self {
    Self {
        map: Arc::new(SkipMap::new()),
        wal: None,
        id: _id,
        approximate_size: Arc::new(AtomicUsize::new(0)),
    }
}

然后是 get 方法，给定 key，返回对应的 value。需要注意的是不能直接将 self.map.get(_key) 的结果直接返回，因为 SkipMap::get 返回的是 Entry<'_, K, V>，而我们需要返回的是 Option<Bytes>。value() 方法返回值的引用，通过 map() 方法可以将其转换为 Option<Bytes>。

/// Get a value by key.
pub fn get(&self, _key: &[u8]) -> Option<Bytes> {
    self.map.get(_key).map(|e| e.value().clone())
}

简单的 put 方法，利用跳表的 insert 方法实现：

/// Put a key-value pair into the mem-table.
pub fn put(&self, _key: &[u8], _value: &[u8]) -> Result<()> {
    self.map
        .insert(Bytes::copy_from_slice(_key), Bytes::copy_from_slice(_value));

    Ok(())
}

Task 2: A Single Memtable in the Engine

这个任务需要将 memtable 加入到 LSM 状态 LsmStorageState 里面。初始化只有一个 memtable，并且 memtable 的 id 从 0 开始。在任何时候，只有一个可变的 memtable，其他的都是不可变（后文的「frozen」）。每个 memtable 都有一个大小的限制。达到这个限制后，可变的 memtable 会被冻结成不可变，并创建一个新的可变 memtable。

在任务 2 中，只需要对当前可变的 memtable 增加 put 和 get 和 delete 方法。

首先是 get 方法的实现。通过 memtable 的 get 方法获取值，如果存在则返回，否则返回 None。

/// Get a key from the storage. In day 7, this can be further optimized by using a bloom filter.
pub fn get(&self, _key: &[u8]) -> Result<Option<Bytes>> {
    let state = Arc::clone(&self.state.read());

    if let Some(val) = state.memtable.get(_key) {
        if val.is_empty() {
            return Ok(None);
        } else {
            return Ok(Some(val));
        }
    }

    Ok(None)
}

然后是 put 方法。同样是通过 memtable 的 put 方法实现插入。

/// Put a key-value pair into the storage by writing into the current memtable.
pub fn put(&self, _key: &[u8], _value: &[u8]) -> Result<()> {
    assert!(!_key.is_empty(), "key cannot be empty");
    assert!(!_value.is_empty(), "value cannot be empty");

    let state = Arc::clone(&self.state.read());
    state.memtable.put(_key, _value)?;

    Ok(())
}

最后是 delete 方法。通过 put 一个空值来实现删除。

/// Remove a key from the storage by writing an empty value.
pub fn delete(&self, _key: &[u8]) -> Result<()> {
    assert!(!_key.is_empty(), "key cannot be empty");

    let state = Arc::clone(&self.state.read());
    state.memtable.put(_key, b"")?;

    Ok(())
}

Task 3: Write Path - Freezing a Memtable

上文提到过，当 memtable 的大小达到限制时，需要将其冻结成不可变，并创建一个新的可变 memtable。这需要计算当前 memtable 的估计大小。一个简单的思路是，每次 put 的时候，计算 key 和 value 的大小并增加。这只是近似的估计，因为多次插入一个 key，只会保存最新的值，但是 key 的大小会被计算多次。（可以理解为，这样计算出来的大小，始终不会小于 memtable 真实的大小？）

Arc<LsmStorageStat> 保存了真实 LsmStorageStat 的快照，克隆这个快照开销相对较少（只需要原子引用计数增加）。LsmStorageState 的修改遵循 Copy-on-Write 的策略： - 读者：获取读锁，克隆快照，释放读锁 - 写者 - 创建新的 LsmStorageStat 实例。可以通过从现有快照复制数据并修改来实现 - 将实例包装为 Arc<...> 类型 - 获取写锁，替换原有快照，释放写锁

代码实现如下。下一个 memtable 的 id 直接从 next_sst_id 获取，然后根据 id 创建新的 memtable。

/// Force freeze the current memtable to an immutable memtable
pub fn force_freeze_memtable(&self, _state_lock_observer: &MutexGuard<'_, ()>) -> Result<()> {
    let next_memtable_id = self.next_sst_id();
    let next_memtable = Arc::new(MemTable::create(next_memtable_id));

    self.freeze_memtable(next_memtable)?;

    Ok(())
}

freeze_memtable 方法。

第 4~5 行代码与第 6 行代码是等价的，将新的 memtable 替换掉旧的 memtable 并返回旧的 memtable。第 8 行将旧的 memtable 加入到 imm_memtables 中。注意到这里使用的是 insert 方法插入到头部，而不是使用 push 方法插入到尾部。这样可以保证从头遍历 imm_memtables 的时候，总是先遍历最新的 memtable。

fn freeze_memtable(&self, memtable: Arc<MemTable>) -> Result<()> {
    let mut state = self.state.write();
    let mut snapshot = state.as_ref().clone();
    // let old_memtable = &mut snapshot.memtable.clone();
    // snapshot.memtable = memtable;
    let old_memtable = std::mem::replace(&mut snapshot.memtable, memtable);

    snapshot.imm_memtables.insert(0, old_memtable.clone());
    *state = Arc::new(snapshot);

    drop(state);

    Ok(())
}

Task 4: Read Path - Get

加入 imm_memtables 后，之前的方法就需要修改。

首先是 mem_table 的 put 方法。在 put 的时候要预估大小，并修改 approximate_size 的值：

pub fn put(&self, _key: &[u8], _value: &[u8]) -> Result<()> {
    let size = _key.len() + _value.len();

    self.map
        .insert(Bytes::copy_from_slice(_key), Bytes::copy_from_slice(_value));
    self.approximate_size
        .fetch_add(size, std::sync::atomic::Ordering::Relaxed);

    Ok(())
}

然后在 LsmStorageInner 的 put 和 delete 方法中，获取预估的大小：

/// Put a key-value pair into the storage by writing into the current memtable.
pub fn put(&self, _key: &[u8], _value: &[u8]) -> Result<()> {
    ...

    let size = state.memtable.approximate_size();
    self.try_freeze(size)?;

    Ok(())
}

/// Remove a key from the storage by writing an empty value.
pub fn delete(&self, _key: &[u8]) -> Result<()> {
    ...

    let size = state.memtable.approximate_size();
    self.try_freeze(size)?;

    Ok(())
}

try_freeze 方法判断预估大小是否大于限制大小。如果是，获取锁再次判断，然后调用 force_freeze_memtable。二次判断的原因是，如果两个线程同时插入数据，两个线程都计算预估大小大于限制大小，可能会导致创建一个新的 mem_table 后被立即冻结。所以第一次判断后加锁，这个时候再次判断是否需要冻结。

fn try_freeze(&self, esitimate_size: usize) -> Result<()> {
    if esitimate_size > self.options.target_sst_size {
        let state_lock = self.state_lock.lock();
        let state = self.state.read();
        if state.memtable.approximate_size() > self.options.target_sst_size {
            drop(state);
            self.force_freeze_memtable(&state_lock)?;
        }
    }

    Ok(())
}

Result

Test Your Understanding

Why doesn’t the memtable provide a delete API?
mem_table 的删除通过插入一个空值来实现。如果要真实地将键值对从 mem_table 中删除，需要遍历整个 mem_table，并删除对应的键值对，修改 mem_table 的大小。而且任何时候只有一个可变的 mem_table，其余都是不可变的。实现删除的 API 会导致性能降低。
Does it make sense for the memtable to store all write operations instead of only the latest version of a key? For example, the user puts a->1, a->2, and a->3 into the same memtable.
如果每次修改都只修改最后一个版本的值，而不是将新值插入进去，那么每次修改需要去遍历 mem_table，找到对应的 key，然后修改对应的 value。同时也和删除一样只有一个可变的 mem_table。这样会导致性能降低。相当于是用空间换时间。
Is it possible to use other data structures as the memtable in LSM? What are the pros/cons of using the skiplist?
可以使用其他数据结构作为 mem_table，比如 B+ 树、哈希表等。使用 skiplist 作为 mem_table 的优点是，插入、删除、查找的时间复杂度都是 O(log n)，而且 skiplist 的内存占用相对较小。
Why do we need a combination of state and state_lock? Can we only use state.read() and state.write()?
state.read() 和 state.write() 都是基于 RwLock 的读写锁，而 state_lock 是一个单独的互斥锁。使用 state_lock 可以避免多个线程同时调用 force_freeze_memtable，从而避免多次冻结同一个 mem_table。如果只使用 state.read() 和 state.write()，可能会导致多个线程同时冻结同一个 mem_table，从而浪费资源。
Why does the order to store and to probe the memtables matter? If a key appears in multiple memtables, which version should you return to the user?
如果 mem_table 的顺序不对，那么在查找 key 的时候，可能会找到旧的 mem_table，从而导致返回旧版本的值。为了保证查找的准确性，需要保证存储和查询 mem_table 的顺序。如果 key 出现在多个 mem_table 中，应该返回最新的版本。
Is the memory layout of the memtable efficient / does it have good data locality? (Think of how Byte is implemented and stored in the skiplist…) What are the possible optimizations to make the memtable more efficient?
在内存布局上，skiplist 的实现方式是每个节点都包含一个指针数组，指向下一层的节点。这样会导致内存布局不连续，从而影响数据局部性。为了优化内存布局，可以使用连续的内存块来存储 skiplist 的节点。
So we are using parking_lot locks in this course. Is its read-write lock a fair lock? What might happen to the readers trying to acquire the lock if there is one writer waiting for existing readers to stop?
这里的读写锁应该是公平的。当新的读进程到来时，如果存在写进程在等待，那么读进程会等待写进程完成。这样才能保证读到的数据是最新的数据。
After freezing the memtable, is it possible that some threads still hold the old LSM state and wrote into these immutable memtables? How does your solution prevent it from happening?
是可能的。如果线程在冻结 mem_table 之前已经获取了读锁，那么在冻结 mem_table 之后，这些线程仍然可以读取旧的 mem_table。为了防止这种情况，可以在冻结 mem_table 之前，获取写锁。这样能保证在冻结前所有的线程都已经释放了读锁。
There are several places that you might first acquire a read lock on state, then drop it and acquire a write lock (these two operations might be in different functions but they happened sequentially due to one function calls the other). How does it differ from directly upgrading the read lock to a write lock? Is it necessary to upgrade instead of acquiring and dropping and what is the cost of doing the upgrade?
释放读锁再获取写锁，可以避免写锁阻塞读锁。如果直接将读锁升级为写锁，那么读锁会被阻塞，直到写锁被释放。这样会导致读锁的等待时间变长。

Prince's Blog

Day1: Memtables