Write Path
in DBImpl::Write
- create a
Writerobject that holds the write batch - lock the mutex (RAII Scoped Lock) and wait for the current write to be at the front of the writer queue
- how do multiple writers lock the same lock at the same time? or do they all contend on the
MutexLock l(&mutex_)?
- how do multiple writers lock the same lock at the same time? or do they all contend on the
- if we need to throttle, then we wait -
DBImpl::MakeRoomForWrite - get the last sequence number
- the Writer now holds responsibility for locking the memtable and WAL, so the main writer thread can release the lock
- WAL commit: https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L1236
- if WAL write was successful, write to memtable https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L1245
- Lock the mutex again, and update the high sequence number
- remove the writers from the queue (why does this happen?)
- signal the waiting head of the write queue
Read Path
in DBImpl::Get
- Create a scoped RAII Lock for the mutex
- Get the snapshot (as determined by a Sequence Number) for the read (either via the
ReadOptionsor by getting the latest Sequence Number) - Create a
LookupKey - refcount the latest memtable, immutable memtable, and
Version- what is
Version?Versionappears to be how LevelDB represents an on-disk snapshot. Conversely,VersionSetis what appears to represent a "snapshot manager"- Each
Versionis a node in a linked list; it links to the next and previous version as well - It has a list of files in each level
- Each
- what is the
manifest_file_number_? Looks to be the snapshot ID, so that we have clear versioning amongst the snapshots.
- what is
- Unlock the mutex and lookup the key in the memtable
- If the key is not present in the memtable, look up the immutable memtable
- if the key is not present in the immutable memtable, look up the current snapshot (see
Version::Get)- create a tracker struct for lookup information
- call into
Version::ForEachOverlapping- takes a callback to run on every file whose key range matches the lookup key- the callback does the actual work of opening the file and reading from it - when the callback returns
false, the search stops and returns - first search level 0 in order of newest to oldest - process all level 0 files that overlap and run the callback on them
- if the callback doesn't return, proceed to search lower levels (1, 2, 3, 4)
- important to note: the guarantee we have now is that only one file will contain the information for any key - so first we search for the first overlapping file within the level, then we search within the file
- for each level, binary search the files until we find the first file whose high key is greater than the lookup key
- if the lookup key is smaller than the low key of the file, then we skip the file
- the lookup key is larger than the low key of the file, we can look within the file
- the callback does the actual work of opening the file and reading from it - when the callback returns
- Within the callback:
- update internal stats tracking for what the last file read in the read was
- look up the
TableCachewithin theVersionSet- TableCache is an LRU Cache (exposed as the interface
Cache) that maps keys to values, and maintains some other metadata for a key in aHandle. TheTableCacheexists because opening an SSTable (lots of work regarding decoding the index blocks, filters, etc) is an expensive process, so we can cache the file metadata in a cache and avoid having to read it on every read for the key. - This opens the file (if not already open) and searches the contents of the
Tablein a block-wise fashion (Table::InternalGet) - if the search is rejected by the bloom filter, return early - Once the block is found, seek to the key
- if the key is present (i.e key we seeked to is equal to search key, not greater/lesser), and not corrupt, and not a deletion, then capture the value
- and return false
- TableCache is an LRU Cache (exposed as the interface
- notes: seeing a lot of "charge against/for", wonder what this means (could this be the usage count?)
LookupKey
This is a special class used to make lookup operations easier in LevelDB. When the user performs a lookup with DB::Get(), the key as the end user sees it - a Slice - is converted into a LookupKey. This serves multiple purposes - you have only one type of key that you use for every operation here on out, whether it's the memtable lookup or the lookup from the files on disk.
An interesting thing to note is the small key optimization, which allocates the key on the stack within the LookupKey class itself if the key is smaller than 200 characters. This prevents a trip to the allocator and the memory subsystem, which generally protects performance at the cost of slightly higher memory usage - but given that the LookupKey is relatively ephemeral and the allocation is fast (i.e. stack frame allocation speed), this temporary wastage is an okay tradeoff to make. It's no doubt that this exact number is based on a combination of heuristics and statistics from the experience of running levelDB in prod for many workloads.
We also see that the SequenceNumber (that we now know to denote a snapshot boundary) is encoded within the byte stream of the key. This is a helpful way of tracking the snapshot we're using for lookup, and, potentially, filtering out results that exist past that snapshot boundary.
https://github.com/google/leveldb/blob/863f185970eff21e826e5fe1164a6215a515c23b/db/dbformat.h#L183
// A helper class useful for DBImpl::Get()
class LookupKey {
public:
// Initialize *this for looking up user_key at a snapshot with
// the specified sequence number.
LookupKey(const Slice& user_key, SequenceNumber sequence);
LookupKey(const LookupKey&) = delete;
LookupKey& operator=(const LookupKey&) = delete;
~LookupKey();
// Return a key suitable for lookup in a MemTable.
Slice memtable_key() const { return Slice(start_, end_ - start_); }
// Return an internal key (suitable for passing to an internal iterator)
Slice internal_key() const { return Slice(kstart_, end_ - kstart_); }
// Return the user key
Slice user_key() const { return Slice(kstart_, end_ - kstart_ - 8); }
private:
// We construct a char array of the form:
// klength varint32 <-- start_
// userkey char[klength] <-- kstart_
// tag uint64
// <-- end_
// The array is a suitable MemTable key.
// The suffix starting with "userkey" can be used as an InternalKey.
const char* start_;
const char* kstart_;
const char* end_;
char space_[200]; // Avoid allocation for short keys
};
Looking at the constructor gives us a little more of a hint as to what's going on, and how exactly the sequence number is encoded.
LookupKey::LookupKey(const Slice& user_key, SequenceNumber s) {
size_t usize = user_key.size();
size_t needed = usize + 13; // A conservative estimate
char* dst;
if (needed <= sizeof(space_)) {
dst = space_;
} else {
dst = new char[needed];
}
start_ = dst;
dst = EncodeVarint32(dst, usize + 8);
kstart_ = dst;
std::memcpy(dst, user_key.data(), usize);
dst += usize;
EncodeFixed64(dst, PackSequenceAndType(s, kValueTypeForSeek));
dst += 8;
end_ = dst;
}
Over here we see that when LevelDB is deciding whether or not it needs to allocate more space for the key to be stored overall, they compute size_t needed = usize + 13 - possibly accounting 5 bytes for the Varint32 and 8 bytes for the Fixed64 that the key size and the sequence number are stored as, respectively, but I was confused as to where the extra 1 is coming from (i.e. why we're using 5 bytes to store a 4-byte integer). Looking at other functions for Varint32, we see that even in PutVarint32() the authors use 5 bytes as a default storage length, confirming that this isn't an approximation of some sort. Let's see why this number is 5.
void PutVarint32(std::string* dst, uint32_t v) {
char buf[5];
char* ptr = EncodeVarint32(buf, v);
dst->append(buf, ptr - buf);
}
A small note on varints later - https://samliu.github.io/2016/10/15/varints.html I see that one bit in every byte, the most significant bit is used to indicate whether or not there's another byte coming. While this is a part of the encoding format, this is not the number itself, so the space it takes doesn't count to the value of the number. The contract that the Varint32 maintains is that you can store a 32 bit unsigned integer i.e. 0 to 2^32 - 1 . The effective capacity of 4 bytes of Varint32 is - considering only the bits available for encoding - 7 + 7 + 7 + 7 = 28, which effectively lets us store 0 to 2^28 - 1, which prevents us from storing numbers in the range 2^28 to 2^32 - 1. This is why we need the extra byte, illustrating that there's no such thing as a free lunch - adopting varints as the sole representation sometimes means that you need to use 5 bytes to store a 4-bit unsigned integer :D
My personal preference is to minimize the use of magic numbers as much as possible - for someone who's reading the code at a glance, it's much harder to understand why there's an 8 or a 13 in the places that it exists. I would create an inline const variable with a comment on top of it explaining why it exists.
Another interesting thing to note is that the encoded length of the key has 8 added to it. I couldn't figure out why this happens.
Now what we know how the encoding works, we can safely summarize that the internal_key_ is simply the key the user passed in for the lookup, with the sequence number appended to it.