Databases Beyond The Data: Part 2. How NoSQL databases work and when to use them

February 22, 2021 Learning

In the previous article (Databases Beyond The Data: Part 1) we discussed an example of in-memory index used in Bitcask and the way it works. We spoke about HashMap implementation, compaction and merging process, disaster recovery and use cases. Today we are going to continue our discussion and review an internal organization of more advanced NoSQL databases, including their indexes, performance optimization techniques, pros and cons of such DB types, and their application areas. However, before we get started, make sure to check the first part of this article, because we are going to assume that you know all the details.

SSTables and LSM-tree

The index we discussed in the previous article was a simple in-memory HashMap with keys corresponding to DB keys and values containing an offset of the corresponding records in the log-structured storage file. This approach is surprisingly useful, but it is not perfect, for example all the records are stored with FIFO principle in mind and we never know which record is going to be next. Hence, we can make compaction process much simpler and faster if we guarantee that all the records inside the segment file are sorted by key. As you may remember, the compaction and merging process assumes that all the segments are read in order to find records with the same key and eliminate all the duplicates except the last one. So, by providing sorting we can simply start reading the input segment files side by side and process only first key from each file, then we can copy only last occasion of the key to a new segment, remove the key from all old segments and repeat the process for the next key until all keys are merged into the new segment.

Another benefit of sorting is "relative location guarantee". For instance, you may know offsets for the records with "backboard" and "backlist" keys, and because of sorting you can be sure that "backfile" is located between those two. This guarantee also provides possibility of using sparse in-memory indexes, i.e. ability to load into the memory only one key for every few kilobytes of segment file, instead of the whole keyset, which gives us a possibility to optimize data storage and reduce I/O bandwidth by grouping records into blocks and compressing them.

Such sorted log-structured storage format is called Sorted String Table, or SSTable in short.

By this moment you might be asking a very reasonable question - how are we going to sort records inside an appended log file on a disk? Well, we are not going to do that. Instead, we can use in-memory data structures such as red-black or AVL trees which provide the possibility to insert data in any order to retrieve it in sorted order. In this case, we can store each new write that comes in into an in-memory balanced tree and into an unsorted log-appended backup file simultaneously to ensure that record is not lost in case of failure.

In-memory balanced trees are sometimes called memtable.

We cannot continue writing into the same memtable infinitely as it constantly grows, so we have to write it out to disk as an SSTable file once a certain volume threshold is reached. This process is extremely fast and easy as the data is already sorted by memtable. While the new SSTable is being created the database can continue serving writes into a new memtable that will replace the one being written.

The things get a bit more complicated when it comes to reads as it is a fairly iterative and sequential process. First, we need to check current memtable and try to find the requested key there, then the most recent segment, after that we should move to the second youngest segment and etc.

LSM Trees

Originally SSTable and memtable terms were introduced by Google in their "Bigtable: A Distributed Storage System for Structured Data" paper, however, they weren't the first who introduced the idea. Back in 1996 Patrick O'Neil described this indexing structure as Log-Structured Merge-Tree or LSM-tree in short, so these terms are somewhat interchangeable.

Use cases

There are many databases which use the described indexing algorithm. We have already spoken about Google's famous Bigtable, but there is more. For instance LevelDB - an open-source on-disk key-value store written by Google fellows Jeffrey Dean and Sanjay Ghemawat, or RocksDB - a fork of Google's LevelDB optimized to exploit many CPU cores, and make efficient use of fast storage, such as solid-state drives, for input/output bound workloads. What's interesting is that both of these databases can be used as an embedded application storage or serve as an application caching solution. LevelDB can also be used as an alternate storage for Riak instead of Bitcask. Other well known databases that use SStables include Cassandra, Amazon DynamoDB, ScyllaDB, HBase and even Apache Lucene - free and open-source search engine used by Elasticsearch and Solr. Lucene engine uses full-text index instead of key-value one, which is much more complex, nevertheless based on a similar idea: given a word in a search query, find all the documents that contain this word.

Bigtable, HBase, Cassandra, DynamoDB, ScyllaDB

All four databases are quite similar in nature as they are based on Bigtable, so their use cases are very similar as well.

Google recommends using Cloud Bigtable to store and query all of the following types of data:

Time-series data, such as CPU and memory usage over time for multiple servers.
Marketing data, such as purchase histories and customer preferences.
Financial data, such as transaction histories, stock prices, and currency exchange rates.
Internet of Things data, such as usage reports from energy meters and home appliances.
Graph data, such as information about how users are connected to one another.

Typical application scenarios for Cassandra and ScyllaDB include:

Internet of Things (IOT).
Fraud detection applications.
Recommendation engines.
Product catalogs and playlists.
Messaging applications.

HBase use cases include following:

Internet of Things (IOT).
Fraud detection applications.
Recommendation engines.
Product catalogs and playlists.
Messaging applications.
Customer 360 applications.
Web applications.
Machine learning model serving.

Check this DB-Engines comparison table to learn more about differences and similarities between these three store types.

Amazon DynamoDB stands a little aside but is also using the underlying LSM Tree structure. DymanoDB was created in 2007 and inspired the design of Cassandra, so these two have very similar use cases and application areas. It provides almost unlimited scalability and submillisecond response time, however, it also comes with a few limitations that may impact your application design.

Elasticsearch & Solr

Both Elasticsearch and Solr are document oriented search engines that provide capabilities of near real-time full-text search and take advantage of all of Lucene features. To learn more about their similarities and differences check this great article published Asaf Yigal.

Pros & cons

Log-structured storages are generally known for better write performance and slower reads due to an iterative read algorithm. There are multiple techniques used by storage engines to optimize their performance like Bloom Filter to optimize reads, size-tiered and leveled compaction, etc.

Log-structured storage engines need to rewrite data multiple times due to repeated compaction and merging process. This process is known as Write amplification, which can be a problem for certain storage types, particularly for SSD that can overwrite blocks only a limited number of times. Despite that SSTable-based storage engines perform much less write operations than other storage engines, so they are considered a better choice for write-heavy workloads.

Also, compaction, merging, sparse indexing with data compression processes tend to optimize storage and reduce data volume on a disk that allows storing more data using the same hardware. Compact data representation allows performing more read and write requests within the available I/O bandwidth.

Generally the compaction & merging process is considered as both, one of the biggest advantages and disadvantages of SSTable storage engines. Despite all the benefits discussed above, the compaction & merging process does consume limited hardware resources to perform all the operations. For instance, when the database is empty, disk write bandwidth can be used exclusively for writes, but if there are many segments stored on a disk, then bandwidth needs to be shared with the compaction process.

SSTable-based storage engines do not throttle the rate of incoming writes even if compaction cannot keep up.

Another disadvantage of SSTable-based storages is that there could be multiple copies of the same index stored in different segments, which makes it impossible to implement strong transactional semantics.

Conclusion

Log-structured datastores become increasingly popular as they generally provide better write performance, which is important in the modern era of big data. There is no surprise that the number of database vendors, which adopt mentioned technologies in their solutions, is growing every day. For instance, ScyllaDB that we reviewed in this article was released on September 22, 2015, only 5 years ago. These databases provide tremendous performance capabilities and can handle large amounts of data, which makes them an obvious choice for many use cases described above.

In the next article of this series we will look at the internals of SQL databases and compare them with what we reviewed so far, so stay tunned.