Oftentimes developers have troubles to select the right database engine for their applications. To choose the right database for the applications, it is critical to understand the data access patterns.
A database system is nothing but an indexed structure with the actual data storage. The actual data storage is almost the same for modern database systems, either SSD drives or hosts with large memory size. Retrieving data from disks is slow, especially without index. To read one block of data, at worst case, the entire disk needs to be scanned. So modern database systems will all have index implemented for efficient read and write operations.
Database systems with different implementations of the index will have different performances. The next section will cover three types of database engine indexes and the appropriate use-cases.
HASH
Hash index is very straightforward in implementation, it behaves similar to the hash table in software programs, just a key-value storage. Note that it is not widely adopted in production, one of the main reason is that Hash index is not durable because most of them lives in memory. So if the database server restarts, the data might not be recovered. But, PostgreSQL 10 introduced durable Hash Index implementation so the above issue has been resolved.
Write data
The Hash function will calculate a hash key based on the write value, and then store the memory location as the value. This operation is usually done in constant time, compared with other index implementations, it is a lot faster.
Read data
Similarly to the write data, given the data key, the system will find the hash key of this data, and the value look up can be done in constant time. This data access pattern is great for equality check, such as where someAttribute = someValue.
But for range query, for example if the application is reading a column value or conditions like where someAttribute > someValue
or Group By someAttribute
, the read operation could be very costly since it needs to look up individual data and handle those range query internally before returning back to the applications.
When choose a Hash Index database, the developers must design their applications carefully since Hash index has restrictions that limit its use-cases despite its great performance on read and write data.
LSM / SSTable
LSM(Log Structured Merge Tree) is a data structure that is designed to handle write heavily data accessing programs. Known databases that uses LSM index are: Apache Cassandra, HBase, and Google Cloud BigTable.
Write data
LSM writes to disk in format of SST(Sorted String Table), a key-value storage table with sorted keys. The basic unit of SST is called Segments, as illustrated below. Each SST consists of many segments.
How does LSM generate such data format and when does it flush the data to disk? LSM maintains a fix sized red-black tree in memory. Whenever there is a write request, LSM will insert the node to red-black tree without accessing the disks. Once the red-black tree has reached to the defined size, usually the size is the one page or multiple page sizes of the external drive, LSM will convert the data into SST format and flush it all to the disks. The batch write and delay insertion is the key reason why LSM indexed database system could handle write-heavy loads.
Additionally, data compaction can be performed on the SST table to optimize the disk usage. In the example shown above, once the compaction is done, only one row will be stored and the row value is apple: 10, banana: 15, dog: 20.
Read data
When reading data, first it will check if the data is stored in memory by traversing the red-black tree. What do we do if the data is not stored in memory? A brutal force solution is to scan the SST to find the value but it would be very costly. To improve the efficiency on read operation, the sparse index is introduced, illustrated as below.
A sparse index stores some key in the SST as its key, and the data offset location as its value to enable fast look up. For instance, if we want to retrieve key avocado, by looking at the sparse index table, we know it is stored somewhere between apple and banana. Instead of scanning the entire table, we only need to look and the block stored between offset 152 and 200. This design also enables efficient range query.
B/B+ Tree
B tree stands for self-balanced tree. It supports search, insertion and deletion in logarithmic time. It can be effectively used to support read and range query operations. Here is the list of databases uses B tree/B+ tree as index: MySQL, PostgreSQL, and MongoDB.
Write data
The basic unit of B tree is node. Each node will have n keys, some metadata information storing reference to the storage, and (n+1) child nodes.
For a write request, the B tree will search for the insert location, just like any binary search tree would do, and then insert the node. Additionally, since each node has a fixed number of child nodes, if the child nodes reach to the limit, it will do a split, a process of electing a new root node from the child nodes and reform the tree.
As the graph shows the 1 key node, but in practice how big should the size of node be? To optimize the performance on both write and read, the node size is often equal to the page size of the external drive. For example, if the page size of the external drive is 16KB and each key has a size of 400B, then each node would contains 16000/400 = 40 keys and 41 child nodes possibly. In this way data can be operated in the minimum size of block to avoid the duplicated random access to external drive.
Read data
Since B tree is self-balanced, the data retrieve can be done in logarithmic time and directed accessed to page location of the external drive. This combination allows high volume reads to be finished with very low latency.
B+ Tree
B+Tree is an advanced version of B tree. There are two key differences:
- Node in B+ Tree does not store reference, it only stores the index, which requires fewer spaces. This enables each node stores more keys.
- B+ tree will elect search node during split, and store the search nodes as leaf node. More importantly, all the leaf nodes are linked. Thus, for a full scan of the data, it only needs to go to the beginning of the leaf node and do a linear traversal.
In summary, B+ tree has better performance on read and range query and that is one of the primary reasons why most relational databases use it as search index so often times the data are queried in multiple rows together.
Conclusion
- If the data storage can be stored in memory and the applications do many equality read, then you should consider choose database with Hash Index.
- If your applications are write intensive, and read pattern is range query, then you should consider choose database with LSM tree index.
- Lastly, if your applications are read intensive and do many range query, such as
LIKE "%s"
orGROUP BY
, then you should definitely consider using database system with B/B+ tree index.