Tkrzw: Difference between revisions
Line 78: | Line 78: | ||
* Improves robustness : database file is not corrupted even under catastrophic situation. |
* Improves robustness : database file is not corrupted even under catastrophic situation. |
||
However, at least in single thread operations, the performance of Tokyo Cabinet is better than Kyoto Cabinet. |
|||
===Effective implementation of hash database=== |
===Effective implementation of hash database=== |
Revision as of 18:07, 10 July 2013
This article needs more links to other articles to help integrate it into the encyclopedia. (June 2013) |
Developer(s) | FAL Labs |
---|---|
Stable release | 1.2.76
|
Repository | |
Written in | C++ |
Type | Database engine, library |
License | GNU General Public License |
Website | fallabs |
Kyoto Cabinet is a library of routines for managing a database. The database is a simple data file containing records, each is a pair of a key and a value. Every key and value is serial bytes with variable length. Both binary data and character string can be used as a key and a value. Each key must be unique within a database. There is neither concept of data tables nor data types. Records are organized in hash table or B+ tree.
The following access methods are provided to the database:
- storing a record with a key and a value
- deleting a record by a key
- retrieving a record by a key
Moreover, traversal access to every key are provided. These access methods are similar to ones of the original dbm (and its followers: NDBM and GDBM) library defined in the UNIX standard. Kyoto Cabinet is an alternative for the DBM because of its higher performance.
Each operation of the hash database has the time complexity of "O(1)". Therefore, in theory, the performance is constant regardless of the scale of the database. In practice, the performance is determined by the speed of the main memory or the storage device. If the size of the database is less than the capacity of the main memory, the performance will seem on-memory speed, which is faster than std::map
of STL. Of course, the database size can be greater than the capacity of the main memory and the upper limit is 8 exabytes. Even in that case, each operation needs only one or two seeking of the storage device.
Each operation of the B+ tree database has the time complexity of "O(log N)". Therefore, in theory, the performance is logarithmic to the scale of the database. Although the performance of random access of the B+ tree database is slower than that of the hash database, the B+ tree database supports sequential access in order of the keys, which realizes forward matching search for strings and range search for integers. The performance of sequential access is much faster than that of random access.
As the API is based on object-oriented design, the hash database and the B+ tree database have same methods which inherited from the upper abstract class. Beside them, seven kinds of databases are provided under the same base class. The prototype hash database is powered by the standard container of std::unordered_map
. The prototype tree database is powered by the standard container of std::map
. The stash database is powered by the original implementation of naive hash map saving memory. The cache hash database is powered by the original implementation of doubly-linked hash map with LRU deletion algorithm. The cache tree database is powered by the cache hash database and provides B+ tree mechanism. The directory hash database is powered by the directory mechanism of the file system and stores records as respective files in a directory. The directory tree database is powered by the directory hash database and provides B+ tree mechanism. All databases have practical utility methods related to transaction and cursor. Programs for command-line interface are also included in the package.
Kyoto Cabinet runs very fast. For example, elapsed time to store one million records is 0.9 seconds for the hash database, and 1.1 seconds for the B+ tree database. Moreover, the size of database is very small. For example, overhead for a record is 16 bytes for the hash database, and 4 bytes for the B+ tree database. Furthermore, scalability of Kyoto Cabinet is great. The database size can be up to 8 EB (9.22e18 bytes).
Kyoto Cabinet is written in the C++ language, and provided as API of C++, C, Java, Python, Ruby, Perl, and Lua. Kyoto Cabinet is available on platforms which have API conforming to C++03 with the TR1 library extensions. Kyoto Cabinet is a free software licensed under the GNU General Public License (GPL). The FOSS License Exception is also provided in order to accommodate products under other free and open source licenses. The Specific FOSS Library Linking Exception is also provided in order to be available in some specific FOSS libraries. On the other hand, a commercial license is also provided. If you use Kyoto Cabinet within a proprietary software, the commercial license is required.
Features
Genealogy
The original DBM was developed by Ken Thompson as a part of the original AT&T UNIX. After that, a lot of followers developed such DBM-like products as NDBM, SDBM, GDBM, TDB, and Berkeley DB. In 2003, Mikio Hirabayashi developed QDBM to replace GDBM for performance reason.
In 2007, Tokyo Cabinet was developed as the successor to QDBM on the following purposes:
- Improves space efficiency: smaller size of database file.
- Improves time efficiency: faster processing speed.
- Improves parallelism: higher performance in multi-thread environment.
- Improves usability: simplified API.
- Improves robustness: database file is not corrupted even under catastrophic situation.
- Supports 64-bit architecture: enormous memory space and database file are available.
These purposes were achieved, making Tokyo Cabinet suitable as a replacement to conventional DBM software.
In 2009, Kyoto Cabinet was developed as another successor to QDBM. Compared with the sibling product (Tokyo Cabinet), the following advantages were pursued:
- Improves space efficiency : smaller size of database file.
- Improves parallelism : higher performance in multi-thread environment.
- Improves portability : abstraction of the lower layer to support non-POSIX systems.
- Improves usability : simplified API, object-oriented design.
- Improves robustness : database file is not corrupted even under catastrophic situation.
However, at least in single thread operations, the performance of Tokyo Cabinet is better than Kyoto Cabinet.
Effective implementation of hash database
Kyoto Cabinet uses hash algorithm to retrieve records. If a bucket array has sufficient number of elements, the time complexity of retrieval is "O(1)". That is, the time required for retrieving a record is constant, regardless of the scale of a database. It is also the same about storing and deleting. Collision of hash values is managed by separate chaining. Data structure of the chains is binary search tree. Even if a bucket array has unusually scarce elements, the time complexity of retrieval is "O(log n)".
Kyoto Cabinet attains performance improvement in retrieval by loading the whole of the bucket array onto the RAM. If the bucket array is on RAM, it is possible to access a region of a target record by about one set of file operations such as "lseek", "read", and "write". The bucket array saved in a file is not read into RAM with the "read" call but directly mapped to RAM with the "mmap" call. Therefore, preparation time on connecting to a database is very short, and two or more processes can share the same memory map.
The hash function used for hash table is MurmurHash 2.0. If the number of elements of the bucket array is about a half of records stored within a database, although it depends on characteristic of the input, the probability of collision of hash values is about 55.3% (35.5% if the same, 20.4% if twice, 11.0% if four times, 5.7% if eight times). In that case, it is possible to retrieve a record by two or less sets of file operations. If it is made into a performance index, in order to handle a database containing one million of records, a bucket array with half a million of elements is required. The size of each element is 6 bytes. That is, if 3M bytes of RAM is available, a database containing one million records can be handled.
When overwriting a record with a value whose size is greater than the existing one, it is necessary to move the region to another position of the file. Because the time complexity of the operation depends on the size of the region of a record, extending values successively is inefficient. However, Kyoto Cabinet deals with this problem by alignment. If the incremental data can be placed in the padding region trailing the records, it is not necessary to move the region of the record.
Generally speaking, while succession of updating, fragmentation of available regions occurs, and the size of a database grows rapidly. Kyoto Cabinet deals with this problem by the free block pool and the automatic defragmentation mechanism. If a record is removed or shifted to another position, the region will be treated as a free block. The free block pool manages free blocks and reuses the best fit region for a new record. The automatic defragmentation is to shift records and free blocks separately. Successive free blocks coalesce into one.
Useful implementation of B+ tree database
Although the B+ tree database is slower than the hash database, it features ordering access to each record. The order can be assigned by users. Records in the B+ tree database are sorted and arranged in logical pages. Sparse index organized in B tree that is multiway balanced tree are maintained for each page. Thus, the time complexity of retrieval and so on is "O(log n)". Cursor is provided to access each record in order. The cursor can jump to a position specified by a key and can step forward or backward from the current position. Because each page is arranged as double linked list, the time complexity of stepping cursor is "O(1)".
The B+ tree database is implemented based on the above the hash database. Because each page of the B+ tree database is stored as each record in the hash database, the B+ tree database inherits efficiency of storage management of the hash database. Because the header of each record is smaller and alignment of each page is adjusted according to the page size, in most cases, the size of database file is cut by half compared to one of the hash database.
Although operations of many pages are required to update the B+ tree database, Kyoto Cabinet expedites the process by the page cache and reducing file operations. The page cache is implemented with double layered LRU list, which realizes that frequently accessed pages are cached in the "hot" list and recently accessed pages are cached in the "warm" LRU list. If the page cache works efficiently and the whole of the sparse index is cached on memory, it is possible to retrieve a record by one or less set of file operations.
Practical functionality
Kyoto Cabinet features transaction mechanisms. It is possible to commit a series of operations between the beginning and the end of the transaction in a lump, or to abort the transaction and perform rollback to the state before the transaction. Two isolation levels are supported: "serializable" and "read uncommitted". Durability is secured by write ahead logging and shadow paging.
Automatic transaction and automatic recovery mechanisms are also supported. If the automatic transaction option is specified when opening the database, every updating operation is guarded by transaction which is committed implicitly. Therefore, durability can be assured without explicit transaction operations. The automatic recovery mechanism works after the database is crashed outside transaction. If inconsistency of the database is detected when opening the database, all regions are scanned as with "fsck" and the database is reconstructed with surviving records implicitly.
Kyoto Cabinet provides two modes to connect to a database: "reader" and "writer". A reader can perform retrieving but neither storing nor deleting. A writer can perform all access methods. Exclusion control between processes is performed when connecting to a database by file locking. While a writer is connected to a database, neither readers nor writers can be connected. While a reader is connected to a database, other readers can be connect, but writers can not. According to this mechanism, data consistency is guaranteed with simultaneous connections in multitasking environment.
Functions of API are reentrant and available in multi-thread environment. Different database objects can be operated in parallel entirely. For simultaneous operations against the same database object, rwlock (reader-writer lock) is used for exclusion control. That is, while a writing thread is operating an object, other reading threads and writing threads are blocked. However, while a reading thread is operating an object, reading threads are not blocked. Locking granularity depends on data structures. The hash database uses record locking. The B+ tree database uses page locking.
In order to improve performance and concurrency, Kyoto Cabinet uses such atomic operations built in popular CPUs as atomic-increment and CAS (compare-and-swap). Lock primitives provided by the native environment such as the POSIX thread package are alternated by own primitives using CAS.
Simple but flexible interfaces
Kyoto Cabinet provides simple APIs based on object-oriented design. Every operation for database is encapsulated and published as lucid methods as "open", "close", "set", "remove", "get", and so on. The classes of the hash database and the B+ tree database are derived class of the common abstract class which defines the interface. Porting an application from one database to another is easy. Moreover, the polymorphic database API is provided to assign a database in run-time.
Kyoto Cabinet supports the visitor pattern. You can define arbitrary database operations with call back functions. The visitor class encapsulates that call back functions and their state data. The database class has the "accept" method, which accepts an instance of the visitor class and calls its functions with a record data. The return value of the call back function is reflected as the new state of the record.
In addition, a lot of useful utilities are provided such as "prefix search", "regex search", "logging", "hot backup", "pseudo-snapshot", and "merging". A framework for MapReduce is also provided. Although it is not distributed, it is useful for aggregate calculation with less CPU loading and less memory usage. The plain text database is an interface to treat a plain text file as a database file. It is useful to use a text file as input or output data for the MapReduce framework. The index database is a wrapper of a polymorphic database in order to improve the efficiency of the "append" operation. It is useful to construct inverted indices.
While the core API is provided for C++, bindings for other languages such as C, Java, Python, Ruby, Perl, and Lua are also provided. Command-line interfaces are also provided corresponding to each API. They are useful for prototyping, test, and debugging.