LevelDB Read and Write Path Outlines

Write Path

in DBImpl::Write

create a Writer object that holds the write batch
lock the mutex (RAII Scoped Lock) and wait for the current write to be at the front of the writer queue
- how do multiple writers lock the same lock at the same time? or do they all contend on the MutexLock l(&mutex_)?
if we need to throttle, then we wait - DBImpl::MakeRoomForWrite
get the last sequence number
the Writer now holds responsibility for locking the memtable and WAL, so the main writer thread can release the lock
WAL commit: https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L1236
if WAL write was successful, write to memtable https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L1245
Lock the mutex again, and update the high sequence number
remove the writers from the queue (why does this happen?)
signal the waiting head of the write queue

Read Path

in DBImpl::Get

Create a scoped RAII Lock for the mutex
Get the snapshot (as determined by a Sequence Number) for the read (either via the ReadOptions or by getting the latest Sequence Number)
Create a LookupKey
refcount the latest memtable, immutable memtable, and Version
- what is Version? Version appears to be how LevelDB represents an on-disk snapshot. Conversely, VersionSet is what appears to represent a "snapshot manager"
  - Each Version is a node in a linked list; it links to the next and previous version as well
  - It has a list of files in each level
- what is the manifest_file_number_? Looks to be the snapshot ID, so that we have clear versioning amongst the snapshots.
Unlock the mutex and lookup the key in the memtable
If the key is not present in the memtable, look up the immutable memtable
if the key is not present in the immutable memtable, look up the current snapshot (see Version::Get)
- create a tracker struct for lookup information
- call into Version::ForEachOverlapping - takes a callback to run on every file whose key range matches the lookup key
  - the callback does the actual work of opening the file and reading from it - when the callback returns false, the search stops and returns
  - first search level 0 in order of newest to oldest - process all level 0 files that overlap and run the callback on them
  - if the callback doesn't return, proceed to search lower levels (1, 2, 3, 4)
    - important to note: the guarantee we have now is that only one file will contain the information for any key - so first we search for the first overlapping file within the level, then we search within the file
    - for each level, binary search the files until we find the first file whose high key is greater than the lookup key
      - if the lookup key is smaller than the low key of the file, then we skip the file
      - the lookup key is larger than the low key of the file, we can look within the file
- Within the callback:
  - update internal stats tracking for what the last file read in the read was
  - look up the TableCache within the VersionSet
    - TableCache is an LRU Cache (exposed as the interface Cache) that maps keys to values, and maintains some other metadata for a key in a Handle. The TableCache exists because opening an SSTable (lots of work regarding decoding the index blocks, filters, etc) is an expensive process, so we can cache the file metadata in a cache and avoid having to read it on every read for the key.
    - This opens the file (if not already open) and searches the contents of the Table in a block-wise fashion (Table::InternalGet) - if the search is rejected by the bloom filter, return early
    - Once the block is found, seek to the key
    - if the key is present (i.e key we seeked to is equal to search key, not greater/lesser), and not corrupt, and not a deletion, then capture the value
      - and return false
  - notes: seeing a lot of "charge against/for", wonder what this means (could this be the usage count?)

LookupKey

This is a special class used to make lookup operations easier in LevelDB. When the user performs a lookup with DB::Get(), the key as the end user sees it - a Slice - is converted into a LookupKey. This serves multiple purposes - you have only one type of key that you use for every operation here on out, whether it's the memtable lookup or the lookup from the files on disk.

An interesting thing to note is the small key optimization, which allocates the key on the stack within the LookupKey class itself if the key is smaller than 200 characters. This prevents a trip to the allocator and the memory subsystem, which generally protects performance at the cost of slightly higher memory usage - but given that the LookupKey is relatively ephemeral and the allocation is fast (i.e. stack frame allocation speed), this temporary wastage is an okay tradeoff to make. It's no doubt that this exact number is based on a combination of heuristics and statistics from the experience of running levelDB in prod for many workloads.

We also see that the SequenceNumber (that we now know to denote a snapshot boundary) is encoded within the byte stream of the key. This is a helpful way of tracking the snapshot we're using for lookup, and, potentially, filtering out results that exist past that snapshot boundary.

https://github.com/google/leveldb/blob/863f185970eff21e826e5fe1164a6215a515c23b/db/dbformat.h#L183

// A helper class useful for DBImpl::Get()  
class LookupKey {  
 public:  
  // Initialize *this for looking up user_key at a snapshot with  
  // the specified sequence number.  
  LookupKey(const Slice& user_key, SequenceNumber sequence);  
  
  LookupKey(const LookupKey&) = delete;  
  LookupKey& operator=(const LookupKey&) = delete;  
  
  ~LookupKey();  
  
  // Return a key suitable for lookup in a MemTable.  
  Slice memtable_key() const { return Slice(start_, end_ - start_); }  
  
  // Return an internal key (suitable for passing to an internal iterator)  
  Slice internal_key() const { return Slice(kstart_, end_ - kstart_); }  
  
  // Return the user key  
  Slice user_key() const { return Slice(kstart_, end_ - kstart_ - 8); }  
  
 private:  
  // We construct a char array of the form:  
  //    klength  varint32               <-- start_  
  //    userkey  char[klength]          <-- kstart_  
  //    tag      uint64  
  //                                    <-- end_  
  // The array is a suitable MemTable key.  
  // The suffix starting with "userkey" can be used as an InternalKey.  
  const char* start_;  
  const char* kstart_;  
  const char* end_;  
  char space_[200];  // Avoid allocation for short keys  
};

Looking at the constructor gives us a little more of a hint as to what's going on, and how exactly the sequence number is encoded.

LookupKey::LookupKey(const Slice& user_key, SequenceNumber s) {
  size_t usize = user_key.size();
  size_t needed = usize + 13;  // A conservative estimate
  char* dst;
  if (needed <= sizeof(space_)) {
    dst = space_;
  } else {
    dst = new char[needed];
  }
  start_ = dst;
  dst = EncodeVarint32(dst, usize + 8);
  kstart_ = dst;
  std::memcpy(dst, user_key.data(), usize);
  dst += usize;
  EncodeFixed64(dst, PackSequenceAndType(s, kValueTypeForSeek));
  dst += 8;
  end_ = dst;
}

Over here we see that when LevelDB is deciding whether or not it needs to allocate more space for the key to be stored overall, they compute size_t needed = usize + 13 - possibly accounting 5 bytes for the Varint32 and 8 bytes for the Fixed64 that the key size and the sequence number are stored as, respectively, but I was confused as to where the extra 1 is coming from (i.e. why we're using 5 bytes to store a 4-byte integer). Looking at other functions for Varint32, we see that even in PutVarint32() the authors use 5 bytes as a default storage length, confirming that this isn't an approximation of some sort. Let's see why this number is 5.

void PutVarint32(std::string* dst, uint32_t v) {  
  char buf[5];  
  char* ptr = EncodeVarint32(buf, v);  
  dst->append(buf, ptr - buf);  
}

A small note on varints later - https://samliu.github.io/2016/10/15/varints.html I see that one bit in every byte, the most significant bit is used to indicate whether or not there's another byte coming. While this is a part of the encoding format, this is not the number itself, so the space it takes doesn't count to the value of the number. The contract that the Varint32 maintains is that you can store a 32 bit unsigned integer i.e. 0 to 2^32 - 1 . The effective capacity of 4 bytes of Varint32 is - considering only the bits available for encoding - 7 + 7 + 7 + 7 = 28, which effectively lets us store 0 to 2^28 - 1, which prevents us from storing numbers in the range 2^28 to 2^32 - 1. This is why we need the extra byte, illustrating that there's no such thing as a free lunch - adopting varints as the sole representation sometimes means that you need to use 5 bytes to store a 4-bit unsigned integer :D

My personal preference is to minimize the use of magic numbers as much as possible - for someone who's reading the code at a glance, it's much harder to understand why there's an 8 or a 13 in the places that it exists. I would create an inline const variable with a comment on top of it explaining why it exists.

Another interesting thing to note is that the encoded length of the key has 8 added to it. I couldn't figure out why this happens.

Now what we know how the encoding works, we can safely summarize that the internal_key_ is simply the key the user passed in for the lookup, with the sequence number appended to it.

Published on: 2026-04-02
Tags: tech systems databases internals featured storage leveldb publish_to_rss