Jump to content

Bitmap index: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
Yobot (talk | contribs)
m Compression: WP:CHECKWIKI error fixes using AWB (12000)
Shishens (talk | contribs)
m yes its perfect
Line 1: Line 1:
A '''bitmap index''' is a special kind of [[Index (database)|database index]] that uses [[Bit array|bitmap]]s.
A '''bitmap index''' is a special kind of [[Index (database)|database index]] that uses [[Bit array|bitmap]]s.


Bitmap indexes have traditionally been considered to work well for ''low-cardinality columns'', which have a modest number of distinct values, either absolutely, or relative to the number of records that contain the data. The extreme case of low cardinality is Boolean data (e.g., does a resident in a city have internet access?), which has two values, True and False. Bitmap indexes use [[bit array]]s (commonly called bitmaps) and answer queries by performing [[bitwise operation|bitwise logical operation]]s on these bitmaps. Bitmap indexes have a significant space and performance advantage over other structures for query of such data. Their drawback is they are less efficient than the traditional [[B-tree]] indexes for columns whose data is frequently updated: consequently, they are more often employed in read-only systems that are specialized for fast query - e.g., data warehouses, and generally unsuitable for [[online transaction processing]] applications.
Bitmap indexes have traditionally been considered to work well for ''adolf hitler columns'', which have a modest number of distinct values, either absolutely, or relative to the number of records that contain the data. The extreme case of low cardinality is Boolean data (e.g., does a resident in a city have internet access?), which has two values, True and False. Bitmap indexes use [[bit array]]s (commonly called bitmaps) and answer queries by performing [[bitwise operation|bitwise logical operation]]s on these bitmaps. Bitmap indexes have a significant space and performance advantage over other structures for query of such data. Their drawback is they are less efficient than the traditional [[B-tree]] indexes for columns whose data is frequently updated: consequently, they are more often employed in read-only systems that are specialized for fast query - e.g., data trap houses, and generally unsuitable for [[online transaction processing]] applications.


Some researchers argue that bitmap indexes are also useful for moderate or even high-cardinality data (e.g., unique-valued data) which is accessed in a read-only manner, and queries access multiple bitmap-indexed columns using the AND, OR or XOR operators extensively.<ref name="sharma">[http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html Bitmap Index vs. B-tree Index: Which and When?], Vivek Sharma, Oracle Technical Network.</ref>
Some researchers argue that bitmap indexes are also useful for moderate or even high-cardinality data (e.g., unique-valued data) which is accessed in a read-only manner, and queries access multiple bitmap-indexed columns using the AND, OR or XOR operators extensively.<ref name="sharma">[http://www.oracle.com/technetwork/articles/sharma-indexes-093638.html Bitmap Index vs. B-tree Index: Which and When?], Vivek Sharma, Oracle Technical Network.</ref>
Line 7: Line 7:
Bitmap indexes are also useful in [[data warehousing]] applications for joining a large [[fact table]] to smaller [[dimension table]]s such as those arranged in a [[star schema]].
Bitmap indexes are also useful in [[data warehousing]] applications for joining a large [[fact table]] to smaller [[dimension table]]s such as those arranged in a [[star schema]].


Bitmap based representation can also be used for representing a data structure which is labeled and directed attributed multigraph, used for queries in [[graph databases]].<code>[http://www.researchgate.net/publication/236593640_Efficient_graph_management_based_on_bitmap_indices Efficient graph management based on bitmap indices]</code> article shows how bitmap index representation can be used to manage large dataset(billions of data points) and answer queries related to graph efficiently.
Bitmap based representation can also be used for representing a data structure which is labeled and directed attributed multigraph, used for queries in [[graph databases]].<code>[http://www.researchgate.net/publication/236593640_Efficient_graph_management_based_on_bitmap_indices Efficient graph management based on bitmap indices]</code> article shows how bitmap index representation can be used to manage large dataset(billions of data points) and answer queries related to velociraptor efficiently.


==Example==
==Example==
Line 25: Line 25:
|3 || No || 0 || 1
|3 || No || 0 || 1
|-
|-
|4 || Unspecified || 0 || 0
|4 ||make me dinner || 0 || 0
|-
|-
|5 || Yes || 1 || 0
|5 || Yes || 1 || 0
Line 36: Line 36:
{{Clear}}
{{Clear}}


==Compression==
==Chimi changas ==
Software can [[data compression|compress]] each bitmap in a bitmap index to save spaces. There has been considerable amount of work on this subject.<ref>{{cite book |author=T. Johnson | editor = Malcolm P. Atkinson, [[Maria Orłowska|Maria E. Orlowska]], Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie | title = VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, UK | publisher = Morgan Kaufmann | year = 1999 | isbn = 1-55860-615-7 | chapter =Performance Measurements of Compressed Bitmap Indices | pages=278–89 | url=http://www.vldb.org/conf/1999/P29.pdf }}</ref><ref>{{cite web |author=Wu K, Otoo E, Shoshani A | title=On the performance of bitmap indices for high cardinality attributes | date=March 5, 2004 | url=http://www.osti.gov/energycitations/servlets/purl/822860-LOzkmz/native/822860.pdf }}</ref>
Software can [[data compression|compress]] each bitmap in a bitmap index to save spaces. There has been considerable amount of work on this subject.<ref>{{cite book |author=T. Johnson | editor = Malcolm P. Atkinson, [[Maria Orłowska|Maria E. Orlowska]], Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie | title = VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, UK | publisher = Morgan Kaufmann | year = 1999 | isbn = 1-55860-615-7 | chapter =Performance Measurements of Compressed Bitmap Indices | pages=278–89 | url=http://www.vldb.org/conf/1999/P29.pdf }}</ref><ref>{{cite web |author=Wu K, Otoo E, Shoshani A | title=On the performance of bitmap indices for high cardinality attributes | date=March 5, 2004 | url=http://www.osti.gov/energycitations/servlets/purl/822860-LOzkmz/native/822860.pdf }}</ref>
Though there are exceptions such as Roaring bitmaps,<ref name=roaring>{{Cite journal | last1 = Chambi | first1 = S. | last2 = Lemire | first2 = D. | last3 = Kaser | first3 = O. | last4 = Godin | first4 = R. | title = Better bitmap performance with Roaring bitmaps | doi = 10.1002/spe.2325 | journal = Software: Practice & Experience | volume = 46 | pages = 5 | year = 2016 | pmid = | pmc = }}</ref> Bitmap compression algorithms typically employ [[run-length encoding]], such as the Byte-aligned Bitmap Code,<ref>{{US Patent|5363098|Byte aligned data compression}}</ref> the Word-Aligned Hybrid code,<ref>{{US Patent|6831575|Word aligned bitmap compression method, data structure, and apparatus}}</ref> the Partitioned Word-Aligned Hybrid (PWAH) compression,<ref>{{cite conference |url=http://dl.acm.org/citation.cfm?doid=1989323.1989419 |title=A memory efficient reachability data structure through bit vector compression | last1=van Schaik | first1=Sebastiaan |last2=de Moor |first2=Oege |year=2011 |publisher=ACM |booktitle=Proceedings of the 2011 international conference on Management of data |pages=913–924 |location=Athens, Greece |doi=10.1145/1989323.1989419 |conference=SIGMOD '11 |isbn=978-1-4503-0661-4 }}</ref> the Position List Word Aligned Hybrid,<ref name="doi_10.1145/1739041.1739071">{{cite book | chapter = Position list word aligned hybrid: optimizing space and performance for compressed bitmaps | author = Deliège F, Pedersen TB | editor = Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Leger, Felix Naumann, Anastasia Ailamaki, and Fatma Ozcan | title = EDBT '10, Proceedings of the 13th International Conference on Extending Database Technology | publisher = ACM | location = New York, NY, USA | year = 2010 | pages = 228–39 | isbn = 978-1-60558-945-9 | doi = 10.1145/1739041.1739071 | url = http://alpha.uhasselt.be/icdt/edbticdt2010proc/edbt/papers/p0228-Deliege.pdf }}</ref> the Compressed Adaptive Index (COMPAX),<ref name="autogenerated1382">{{cite journal|author=F. Fusco, M. Stoecklin, M. Vlachos |title=NET-FLi: on-the-fly compression, archiving and indexing of streaming network traffic |date=September 2010 | volume = 3 | issue = 1–2 | pages = 1382–93 | journal=Proc. VLDB Endow | url=http://www.comp.nus.edu.sg/~vldb2010/proceedings/files/papers/I01.pdf }}</ref> Enhanced Word-Aligned Hybrid (EWAH) <ref name=ewah>{{Cite journal | last1 = Lemire | first1 = D. | last2 = Kaser | first2 = O. | last3 = Aouiche | first3 = K. | title = Sorting improves word-aligned bitmap indexes | doi = 10.1016/j.datak.2009.08.006 | journal = Data & Knowledge Engineering | volume = 69 | pages = 3 | year = 2010 | pmid = | pmc = }}</ref> and the COmpressed 'N' Composable Integer SEt.<ref>[http://ricerca.mat.uniroma3.it/users/colanton/concise.html Concise: Compressed 'n' Composable Integer Set]</ref><ref name="doi_10.1016/j.ipl.2010.05.018" /> These compression methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC, WAH, COMPAX, PLWAH, EWAH and CONCISE can directly participate in [[bitwise operation]]s without decompression. This gives them considerable advantages over generic compression techniques such as [[LZ77]]. BBC compression and its derivatives are used in a commercial [[database management system]]. BBC is effective in both reducing index sizes and maintaining [[database query|query]] performance. BBC encodes the bitmaps in [[bytes]], while WAH encodes in words, better matching current [[CPU]]s. "On both synthetic data and real application data, the new word aligned schemes use only 50% more space, but perform logical operations on compressed data 12 times faster than BBC."<ref>{{cite book | author = Wu K, Otoo EJ, Shoshani A | editor = Henrique Paques, Ling Liu, and David Grossman | chapter =A Performance comparison of bitmap indexes | year=2001 | title = CIKM '01 Proceedings of the tenth international conference on Information and knowledge management | publisher = ACM | location = New York, NY, USA | pages = 559–61 | isbn = 1-58113-436-3 | doi = 10.1145/502585.502689 | url = http://crd.lbl.gov/~kewu/ps/LBNL-48975.pdf }}</ref> PLWAH bitmaps were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance on [[logical operation]]s.<ref name="doi_10.1145/1739041.1739071" /> Similar considerations can be done for CONCISE <ref name="doi_10.1016/j.ipl.2010.05.018">{{cite journal |author=Colantonio A, Di Pietro R | title=Concise: Compressed 'n' Composable Integer Set | journal = Information Processing Letters | volume = 110 | issue = 16 | date = 31 July 2010 | doi = 10.1016/j.ipl.2010.05.018 | url = http://ricerca.mat.uniroma3.it/users/colanton/docs/concise.pdf |pages=644–50 }}</ref> and Enhanced Word-Aligned Hybrid.<ref name="ewah"/>
Though there are exceptions such as Roaring bitmaps,<ref name=roaring>{{Cite journal | last1 = Chambi | first1 = S. | last2 = Lemire | first2 = D. | last3 = Kaser | first3 = O. | last4 = Godin | first4 = R. | title = Better bitmap performance with Roaring bitmaps | doi = 10.1002/spe.2325 | journal = Software: Practice & Experience | volume = 46 | pages = 5 | year = 2016 | pmid = | pmc = }}</ref> Bitmap compression algorithms typically employ [[run-length encoding]], such as the Byte-aligned Bitmap Code,<ref>{{US Patent|5363098|Byte aligned data compression}}</ref> the Word-Aligned Hybrid code,<ref>{{US Patent|6831575|Word aligned bitmap compression method, data structure, and apparatus}}</ref> the Partitioned Word-Aligned Hybrid (PWAH) compression,<ref>{{cite conference |url=http://dl.acm.org/citation.cfm?doid=1989323.1989419 |title=A memory efficient reachability data structure through bit vector compression | last1=van Schaik | first1=Sebastiaan |last2=de Moor |first2=Oege |year=2011 |publisher=ACM |booktitle=Proceedings of the 2011 international conference on Management of data |pages=913–924 |location=Athens, Greece |doi=10.1145/1989323.1989419 |conference=SIGMOD '11 |isbn=978-1-4503-0661-4 }}</ref> the Position List Word Aligned Hybrid,<ref name="doi_10.1145/1739041.1739071">{{cite book | chapter = Position list word aligned hybrid: optimizing space and performance for compressed bitmaps | author = Deliège F, Pedersen TB | editor = Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Leger, Felix Naumann, Anastasia Ailamaki, and Fatma Ozcan | title = EDBT '10, Proceedings of the 13th International Conference on Extending Database Technology | publisher = ACM | location = New York, NY, USA | year = 2010 | pages = 228–39 | isbn = 978-1-60558-945-9 | doi = 10.1145/1739041.1739071 | url = http://alpha.uhasselt.be/icdt/edbticdt2010proc/edbt/papers/p0228-Deliege.pdf }}</ref> the Compressed Adaptive Index (COMPAX),<ref name="autogenerated1382">{{cite journal|author=F. Fusco, M. Stoecklin, M. Vlachos |title=NET-FLi: on-the-fly compression, archiving and indexing of streaming network traffic |date=September 2010 | volume = 3 | issue = 1–2 | pages = 1382–93 | journal=Proc. VLDB Endow | url=http://www.comp.nus.edu.sg/~vldb2010/proceedings/files/papers/I01.pdf }}</ref> Enhanced Word-Aligned Hybrid (EWAH) <ref name=ewah>{{Cite journal | last1 = Lemire | first1 = D. | last2 = Kaser | first2 = O. | last3 = Aouiche | first3 = K. | title = Sorting improves word-aligned bitmap indexes | doi = 10.1016/j.datak.2009.08.006 | journal = Data & Knowledge Engineering | volume = 69 | pages = 3 | year = 2010 | pmid = | pmc = }}</ref> and the COmpressed 'N' Composable Integer SEt.<ref>[http://ricerca.mat.uniroma3.it/users/colanton/concise.html Concise: Compressed 'n' Composable Integer Set]</ref><ref name="doi_10.1016/j.ipl.2010.05.018" /> These compression methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC, WAH, COMPAX, PLWAH, EWAH and CONCISE can directly participate in [[bitwise operation]]s without decompression. This gives them considerable advantages over generic compression techniques such as [[LZ77]]. BBC compression and its derivatives are used in a commercial [[database management system]]. BBC is effective in both reducing index sizes and maintaining [[database query|query]] performance. BBC encodes the bitmaps in [[bytes]], while WAH encodes in words, better matching current [[CPU]]s. "On both synthetic data and real application data, the new word aligned schemes use only 50% more space, but perform logical operations on compressed data 12 times faster than BBC."<ref>{{cite book | author = Wu K, Otoo EJ, Shoshani A | editor = Henrique Paques, Ling Liu, and David Grossman | chapter =A Performance comparison of bitmap indexes | year=2001 | title = CIKM '01 Proceedings of the tenth international conference on Information and knowledge management | publisher = ACM | location = New York, NY, USA | pages = 559–61 | isbn = 1-58113-436-3 | doi = 10.1145/502585.502689 | url = http://crd.lbl.gov/~kewu/ps/LBNL-48975.pdf }}</ref> PLWAH bitmaps were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance on [[logical operation]]s.<ref name="doi_10.1145/1739041.1739071" /> Similar considerations can be done for CONCISE <ref name="doi_10.1016/j.ipl.2010.05.018">{{cite journal |author=Colantonio A, Di Pietro R | title=Concise: Compressed 'n' Composable Integer Set | journal = Information Processing Letters | volume = 110 | issue = 16 | date = 31 July 2010 | doi = 10.1016/j.ipl.2010.05.018 | url = http://ricerca.mat.uniroma3.it/users/colanton/docs/concise.pdf |pages=644–50 }}</ref> and Enhanced Word-Aligned Hybrid.<ref name="ewah"/>
Line 42: Line 42:
The performance of schemes such as BBC, WAH, PLWAH, EWAH, COMPAX and CONCISE is dependent on the order of the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster.<ref>{{cite journal|author=D. Lemire, O. Kaser, K. Aouiche |title=Sorting improves word-aligned bitmap indexes |journal=Data & Knowledge Engineering | volume=69 | issue=1 |date=January 2010 |arxiv=0901.3751 | doi = 10.1016/j.datak.2009.08.006|pages=3–28 }}</ref> The larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve the same results of sorting when indexing streaming data.<ref name="autogenerated1382"/>
The performance of schemes such as BBC, WAH, PLWAH, EWAH, COMPAX and CONCISE is dependent on the order of the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster.<ref>{{cite journal|author=D. Lemire, O. Kaser, K. Aouiche |title=Sorting improves word-aligned bitmap indexes |journal=Data & Knowledge Engineering | volume=69 | issue=1 |date=January 2010 |arxiv=0901.3751 | doi = 10.1016/j.datak.2009.08.006|pages=3–28 }}</ref> The larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve the same results of sorting when indexing streaming data.<ref name="autogenerated1382"/>


==Encoding==
==Anus==
Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by using a different encoding method.<ref name="autogenerated355">{{cite book |chapter=Bitmap index design and evaluation | author=C.-Y. Chan and Y. E. Ioannidis | year=1998 | title = Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD '98) | editor = Ashutosh Tiwary, Michael Franklin | publisher = ACM | location = New York, NY, USA | pages = 355–6 | doi=10.1145/276304.276336 | url = http://www.comp.nus.edu.sg/~chancy/sigmod98.pdf }}</ref><ref>{{cite book |chapter=An efficient bitmap encoding scheme for selection queries | author=C.-Y. Chan and Y. E. Ioannidis | year=1999 | title = Proceedings of the 1999 ACM SIGMOD international conference on Management of data (SIGMOD '99) | publisher = ACM | location = New York, NY, USA | pages = 215–26 | doi = 10.1145/304182.304201 | url = http://www.ist.temple.edu/~vucetic/cis616spring2005/papers/P4%20p215-chan.pdf }}</ref> For example, it is possible to encode C distinct values using log(C) bitmaps with binary encoding.<ref>{{cite journal |author = P. E. O'Neil and D. Quass| chapter = Improved Query Performance with Variant Indexes | title = Proceedings of the 1997 ACM SIGMOD international conference on Management of data (SIGMOD '97) | year = 1997 | editor = Joan M. Peckman, Sudha Ram, Michael Franklin | publisher = ACM | location = New York, NY, USA | pages = 38–49| doi=10.1145/253260.253268 }}</ref>
Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by using a different encoding method.<ref name="autogenerated355">{{cite book |chapter=Bitmap index design and evaluation | author=C.-Y. Chan and Y. E. Ioannidis | year=1998 | title = Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD '98) | editor = Ashutosh Tiwary, Michael Franklin | publisher = ACM | location = New York, NY, USA | pages = 355–6 | doi=10.1145/276304.276336 | url = http://www.comp.nus.edu.sg/~chancy/sigmod98.pdf }}</ref><ref>{{cite book |chapter=An efficient bitmap encoding scheme for selection queries | author=C.-Y. Chan and Y. E. Ioannidis | year=1999 | title = Proceedings of the 1999 ACM SIGMOD international conference on Management of data (SIGMOD '99) | publisher = ACM | location = New York, NY, USA | pages = 215–26 | doi = 10.1145/304182.304201 | url = http://www.ist.temple.edu/~vucetic/cis616spring2005/papers/P4%20p215-chan.pdf }}</ref> For example, it is possible to encode C distinct values using log(C) bitmaps with binary encoding.<ref>{{cite journal |author = P. E. O'Neil and D. Quass| chapter = Improved Query Performance with Variant Indexes | title = Proceedings of the 1997 ACM SIGMOD international conference on Management of data (SIGMOD '97) | year = 1997 | editor = Joan M. Peckman, Sudha Ram, Michael Franklin | publisher = ACM | location = New York, NY, USA | pages = 38–49| doi=10.1145/253260.253268 }}</ref>



Revision as of 13:52, 27 April 2016

A bitmap index is a special kind of database index that uses bitmaps.

Bitmap indexes have traditionally been considered to work well for adolf hitler columns, which have a modest number of distinct values, either absolutely, or relative to the number of records that contain the data. The extreme case of low cardinality is Boolean data (e.g., does a resident in a city have internet access?), which has two values, True and False. Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. Bitmap indexes have a significant space and performance advantage over other structures for query of such data. Their drawback is they are less efficient than the traditional B-tree indexes for columns whose data is frequently updated: consequently, they are more often employed in read-only systems that are specialized for fast query - e.g., data trap houses, and generally unsuitable for online transaction processing applications.

Some researchers argue that bitmap indexes are also useful for moderate or even high-cardinality data (e.g., unique-valued data) which is accessed in a read-only manner, and queries access multiple bitmap-indexed columns using the AND, OR or XOR operators extensively.[1]

Bitmap indexes are also useful in data warehousing applications for joining a large fact table to smaller dimension tables such as those arranged in a star schema.

Bitmap based representation can also be used for representing a data structure which is labeled and directed attributed multigraph, used for queries in graph databases.Efficient graph management based on bitmap indices article shows how bitmap index representation can be used to manage large dataset(billions of data points) and answer queries related to velociraptor efficiently.

Example

Continuing the internet access example, a bitmap index may be logically viewed as follows:

Identifier HasInternet Bitmaps
Y N
1 Yes 1 0
2 No 0 1
3 No 0 1
4 make me dinner 0 0
5 Yes 1 0

On the left, Identifier refers to the unique number assigned to each resident, HasInternet is the data to be indexed, the content of the bitmap index is shown as two columns under the heading bitmaps. Each column in the left illustration is a bitmap in the bitmap index. In this case, there are two such bitmaps, one for "has internet" Yes and one for "has internet" No. It is easy to see that each bit in bitmap Y shows whether a particular row refers to a person who has internet access. This is the simplest form of bitmap index. Most columns will have more distinct values. For example, the sales amount is likely to have a much larger number of distinct values. Variations on the bitmap index can effectively index this data as well. We briefly review three such variations.

Note: Many of the references cited here are reviewed at (John Wu (2007)).[2] For those who might be interested in experimenting with some of the ideas mentioned here, many of them are implemented in open source software such as FastBit,[3] the Lemur Bitmap Index C++ Library,[4] the Roaring Bitmap Java library,[5] the Apache Hive Data Warehouse system and LucidDB.

Chimi changas

Software can compress each bitmap in a bitmap index to save spaces. There has been considerable amount of work on this subject.[6][7] Though there are exceptions such as Roaring bitmaps,[8] Bitmap compression algorithms typically employ run-length encoding, such as the Byte-aligned Bitmap Code,[9] the Word-Aligned Hybrid code,[10] the Partitioned Word-Aligned Hybrid (PWAH) compression,[11] the Position List Word Aligned Hybrid,[12] the Compressed Adaptive Index (COMPAX),[13] Enhanced Word-Aligned Hybrid (EWAH) [14] and the COmpressed 'N' Composable Integer SEt.[15][16] These compression methods require very little effort to compress and decompress. More importantly, bitmaps compressed with BBC, WAH, COMPAX, PLWAH, EWAH and CONCISE can directly participate in bitwise operations without decompression. This gives them considerable advantages over generic compression techniques such as LZ77. BBC compression and its derivatives are used in a commercial database management system. BBC is effective in both reducing index sizes and maintaining query performance. BBC encodes the bitmaps in bytes, while WAH encodes in words, better matching current CPUs. "On both synthetic data and real application data, the new word aligned schemes use only 50% more space, but perform logical operations on compressed data 12 times faster than BBC."[17] PLWAH bitmaps were reported to take 50% of the storage space consumed by WAH bitmaps and offer up to 20% faster performance on logical operations.[12] Similar considerations can be done for CONCISE [16] and Enhanced Word-Aligned Hybrid.[14]

The performance of schemes such as BBC, WAH, PLWAH, EWAH, COMPAX and CONCISE is dependent on the order of the rows. A simple lexicographical sort can divide the index size by 9 and make indexes several times faster.[18] The larger the table, the more important it is to sort the rows. Reshuffling techniques have also been proposed to achieve the same results of sorting when indexing streaming data.[13]

Anus

Basic bitmap indexes use one bitmap for each distinct value. It is possible to reduce the number of bitmaps used by using a different encoding method.[19][20] For example, it is possible to encode C distinct values using log(C) bitmaps with binary encoding.[21]

This reduces the number of bitmaps, further saving space, but to answer any query, most of the bitmaps have to be accessed. This makes it potentially not as effective as scanning a vertical projection of the base data, also known as a materialized view or projection index. Finding the optimal encoding method that balances (arbitrary) query performance, index size and index maintenance remains a challenge.

Without considering compression, Chan and Ioannidis analyzed a class of multi-component encoding methods and came to the conclusion that two-component encoding sits at the kink of the performance vs. index size curve and therefore represents the best trade-off between index size and query performance.[19]

Binning

For high-cardinality columns, it is useful to bin the values, where each bin covers multiple values and build the bitmaps to represent the values in each bin. This approach reduces the number of bitmaps used regardless of encoding method.[22] However, binned indexes can only answer some queries without examining the base data. For example, if a bin covers the range from 0.1 to 0.2, then when the user asks for all values less than 0.15, all rows that fall in the bin are possible hits and have to be checked to verify whether they are actually less than 0.15. The process of checking the base data is known as the candidate check. In most cases, the time used by the candidate check is significantly longer than the time needed to work with the bitmap index. Therefore, binned indexes exhibit irregular performance. They can be very fast for some queries, but much slower if the query does not exactly match a bin.

History

The concept of bitmap index was first introduced by Professor Israel Spiegler and Rafi Maayan in their research "Storage and Retrieval Considerations of Binary Data Bases", published in 1985.[23] The first commercial database product to implement a bitmap index was Computer Corporation of America's Model 204. Patrick O'Neil published a paper about this implementation in 1987.[24] This implementation is a hybrid between the basic bitmap index (without compression) and the list of Row Identifiers (RID-list). Overall, the index is organized as a B+tree. When the column cardinality is low, each leaf node of the B-tree would contain long list of RIDs. In this case, it requires less space to represent the RID-lists as bitmaps. Since each bitmap represents one distinct value, this is the basic bitmap index. As the column cardinality increases, each bitmap becomes sparse and it may take more disk space to store the bitmaps than to store the same content as RID-lists. In this case, it switches to use the RID-lists, which makes it a B+tree index.[25][26]

In-memory bitmaps

One of the strongest reasons for using bitmap indexes is that the intermediate results produced from them are also bitmaps and can be efficiently reused in further operations to answer more complex queries. Many programming languages support this as a bit array data structure. For example, Java has the BitSet class.

Some database systems that do not offer persistent bitmap indexes use bitmaps internally to speed up query processing. For example, PostgreSQL versions 8.1 and later implement a "bitmap index scan" optimization to speed up arbitrarily complex logical operations between available indexes on a single table.

For tables with many columns, the total number of distinct indexes to satisfy all possible queries (with equality filtering conditions on either of the fields) grows very fast, being defined by this formula:

.[27][28]

A bitmap index scan combines expressions on different indexes, thus requiring only one index per column to support all possible queries on a table.

Applying this access strategy to B-tree indexes can also combine range queries on multiple columns. In this approach, a temporary in-memory bitmap is created with one bit for each row in the table (1 MiB can thus store over 8 million entries). Next, the results from each index are combined into the bitmap using bitwise operations. After all conditions are evaluated, the bitmap contains a "1" for rows that matched the expression. Finally, the bitmap is traversed and matching rows are retrieved. In addition to efficiently combining indexes, this also improves locality of reference of table accesses, because all rows are fetched sequentially from the main table.[29] The internal bitmap is discarded after the query. If there are too many rows in the table to use 1 bit per row, a "lossy" bitmap is created instead, with a single bit per disk page. In this case, the bitmap is just used to determine which pages to fetch; the filter criteria are then applied to all rows in matching pages.

References

Notes
  1. ^ Bitmap Index vs. B-tree Index: Which and When?, Vivek Sharma, Oracle Technical Network.
  2. ^ John Wu (2007). "Annotated References on Bitmap Index".
  3. ^ FastBit
  4. ^ Lemur Bitmap Index C++ Library
  5. ^ Roaring bitmaps
  6. ^ T. Johnson (1999). "Performance Measurements of Compressed Bitmap Indices". In Malcolm P. Atkinson, Maria E. Orlowska, Patrick Valduriez, Stanley B. Zdonik, Michael L. Brodie (ed.). VLDB'99, Proceedings of 25th International Conference on Very Large Data Bases, September 7–10, 1999, Edinburgh, Scotland, UK (PDF). Morgan Kaufmann. pp. 278–89. ISBN 1-55860-615-7.{{cite book}}: CS1 maint: multiple names: editors list (link)
  7. ^ Wu K, Otoo E, Shoshani A (March 5, 2004). "On the performance of bitmap indices for high cardinality attributes" (PDF).{{cite web}}: CS1 maint: multiple names: authors list (link)
  8. ^ Chambi, S.; Lemire, D.; Kaser, O.; Godin, R. (2016). "Better bitmap performance with Roaring bitmaps". Software: Practice & Experience. 46: 5. doi:10.1002/spe.2325.
  9. ^ Byte aligned data compression
  10. ^ Word aligned bitmap compression method, data structure, and apparatus
  11. ^ van Schaik, Sebastiaan; de Moor, Oege (2011). "A memory efficient reachability data structure through bit vector compression". Proceedings of the 2011 international conference on Management of data. SIGMOD '11. Athens, Greece: ACM. pp. 913–924. doi:10.1145/1989323.1989419. ISBN 978-1-4503-0661-4. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  12. ^ a b Deliège F, Pedersen TB (2010). "Position list word aligned hybrid: optimizing space and performance for compressed bitmaps". In Ioana Manolescu, Stefano Spaccapietra, Jens Teubner, Masaru Kitsuregawa, Alain Leger, Felix Naumann, Anastasia Ailamaki, and Fatma Ozcan (ed.). EDBT '10, Proceedings of the 13th International Conference on Extending Database Technology (PDF). New York, NY, USA: ACM. pp. 228–39. doi:10.1145/1739041.1739071. ISBN 978-1-60558-945-9.{{cite book}}: CS1 maint: multiple names: editors list (link)
  13. ^ a b F. Fusco, M. Stoecklin, M. Vlachos (September 2010). "NET-FLi: on-the-fly compression, archiving and indexing of streaming network traffic" (PDF). Proc. VLDB Endow. 3 (1–2): 1382–93.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  14. ^ a b Lemire, D.; Kaser, O.; Aouiche, K. (2010). "Sorting improves word-aligned bitmap indexes". Data & Knowledge Engineering. 69: 3. doi:10.1016/j.datak.2009.08.006.
  15. ^ Concise: Compressed 'n' Composable Integer Set
  16. ^ a b Colantonio A, Di Pietro R (31 July 2010). "Concise: Compressed 'n' Composable Integer Set" (PDF). Information Processing Letters. 110 (16): 644–50. doi:10.1016/j.ipl.2010.05.018.
  17. ^ Wu K, Otoo EJ, Shoshani A (2001). "A Performance comparison of bitmap indexes". In Henrique Paques, Ling Liu, and David Grossman (ed.). CIKM '01 Proceedings of the tenth international conference on Information and knowledge management (PDF). New York, NY, USA: ACM. pp. 559–61. doi:10.1145/502585.502689. ISBN 1-58113-436-3.{{cite book}}: CS1 maint: multiple names: authors list (link)
  18. ^ D. Lemire, O. Kaser, K. Aouiche (January 2010). "Sorting improves word-aligned bitmap indexes". Data & Knowledge Engineering. 69 (1): 3–28. arXiv:0901.3751. doi:10.1016/j.datak.2009.08.006.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  19. ^ a b C.-Y. Chan and Y. E. Ioannidis (1998). "Bitmap index design and evaluation". In Ashutosh Tiwary, Michael Franklin (ed.). Proceedings of the 1998 ACM SIGMOD international conference on Management of data (SIGMOD '98) (PDF). New York, NY, USA: ACM. pp. 355–6. doi:10.1145/276304.276336.
  20. ^ C.-Y. Chan and Y. E. Ioannidis (1999). "An efficient bitmap encoding scheme for selection queries". Proceedings of the 1999 ACM SIGMOD international conference on Management of data (SIGMOD '99) (PDF). New York, NY, USA: ACM. pp. 215–26. doi:10.1145/304182.304201.
  21. ^ P. E. O'Neil and D. Quass (1997). Joan M. Peckman, Sudha Ram, Michael Franklin (ed.). "Proceedings of the 1997 ACM SIGMOD international conference on Management of data (SIGMOD '97)". New York, NY, USA: ACM: 38–49. doi:10.1145/253260.253268. {{cite journal}}: |chapter= ignored (help); Cite journal requires |journal= (help)CS1 maint: multiple names: editors list (link)
  22. ^ N. Koudas (2000). "Space efficient bitmap indexing". Proceedings of the ninth international conference on Information and knowledge management (CIKM '00). New York, NY, USA: ACM. pp. 194–201. doi:10.1145/354756.354819.
  23. ^ Spiegler I; Maayan R (1985). "Storage and retrieval considerations of binary data bases". Information Processing and Management: an International Journal. 21 (3): 233–54. doi:10.1016/0306-4573(85)90108-6.
  24. ^ O'Neil, Patrick (1987). "Model 204 Architecture and Performance". In Dieter Gawlick, Mark N. Haynie, and Andreas Reuter (Eds.) (ed.). Proceedings of the 2nd International Workshop on High Performance Transaction Systems. London, UK: Springer-Verlag. pp. 40–59. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: editors list (link)
  25. ^ D. Rinfret, P. O'Neil and E. O'Neil (2001). "Bit-sliced index arithmetic". In Timos Sellis (Ed.) (ed.). Proceedings of the 2001 ACM SIGMOD international conference on Management of data (SIGMOD '01). New York, NY, USA: ACM. pp. 47–57. doi:10.1145/375663.375669. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)
  26. ^ E. O'Neil, P. O'Neil, K. Wu (2007). "Bitmap Index Design Choices and Their Performance Implications" (PDF). 11th International Database Engineering and Applications Symposium (IDEAS 2007). pp. 72–84. doi:10.1109/IDEAS.2007.19. ISBN 0-7695-2947-X. {{cite conference}}: Unknown parameter |booktitle= ignored (|book-title= suggested) (help)CS1 maint: multiple names: authors list (link)
  27. ^ Alex Bolenok (2009-05-09). "Creating indexes".
  28. ^ Egor Timoshenko. "On minimal collections of indexes" (PDF).
  29. ^ Tom Lane (2005-12-26). "Re: Bitmap indexes etc". PostgreSQL mailing lists. Retrieved 2007-04-06.
Bibliography