This is a brain dump of My thoughts and things I thought were interesting about the levelDB codebase.
levelDB doesn't have any external synchronization primitives - it's interesting to note this as a design primitive, where the write path handles queueing/waiting for access to the internal store. I found it fascinating that your current write is just in queue waiting to get access, and this happens without the user being aware of any queueing.
fsync is optional - the write options have a way to let you skip fsyncing the write, leaving it potentially stuck in the pagecache. I don't think this is a good thing, and I'm struggling to see a usecase for when I might want to skip fsync.
there are two memtables - this is common and back when it was designed i'm sure IO slowness would have warranted it, because you usually have two memtables when flushing a memtable to disk takes a lot of time. In the grand scheme of things it still does - one in-memory IO vs one disk IO.
the WAL is represented as a log, which has its own internal buffering.
Having an overloaded Status for returning errors is actually good. Saves you a lot of guessing.
the Env functionality lets the end-user override a lot of codepaths (especially with regard to filesystem operations) - this allows for some really cool customizations and feature injections. For example, one might wish to print something/log something every time a file is deleted. This is also great for wrangling cross-platform file management quirks and implementations.
LevelDB also implements writer throttling - In LSM Trees, the number of files present in L0 are a dominating factor in read latency. In the tiered compaction strategy, we can enforce the invariant that due to non-overlapping SSTable ranges in every level, for every level except level 0, we can guarantee that only one SSTable file has the key we're concerned about, so the IO load scales proportionally with the number of levels. This doesn't apply to L0, where all the memtables are flushed directly in potentially overlapping ranges, so we may need to read more than one file on every read.
When the number of files in l0 go above a certain threshold, we throttle writes by trying to compact, or delay, within the writer.
see: https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L1331
Manual reference counting - the memtable (at least; i've seen it in other places too, namely, the Version ) in LevelDB re-implements std::shared_ptr<T> semantics with manual reference counting. This is very curious and interesting to see - the git blame (possibly from a refactoring /sync commit) comes from 2011, which is also exactly when std::shared_ptr<T> was introduced as a part of C++11. Internal methods manually increase the reference count of this memtable.
The only scenario I can think of is that if some iterator is reading from the memtable and for whatever reason the memtable is flushed, we don't want the memtable to disappear.
class MemTable {
public:
// MemTables are reference counted. The initial reference count
// is zero and the caller must call Ref() at least once. explicit MemTable(const InternalKeyComparator& comparator);
...
// Increase reference count.
void Ref() { ++refs_; }
// Drop reference count. Delete if no more references exist.
void Unref() {
--refs_;
assert(refs_ >= 0);
if (refs_ <= 0) {
delete this;
}
}
...
}
the Read codepath also can cause a compaction - it looks like whenever the snapshot is edited due to a compaction, it computes a recommendation for the next level to compact. The read codepath then tries to schedule a compaction (if recommended - measured via Version::compaction_score_) in the DBImpl::MaybeScheduleCompaction codepath.
https://github.com/google/leveldb/blob/4a0c572440c7df2f56a6f5fb5aec9e366d522edb/db/db_impl.cc#L668