jump to navigation

Bitmap Indexes With Many Distinct Column Values (Wots…uh the deal) February 1, 2008

Posted by Richard Foote in Bitmap Indexes, Oracle General, Oracle Indexes, Oracle Myths.
trackback

In the seemingly never ending list of 8 things one may not have known about indexes, Number 3 stated:

“Bitmap Indexes can be extremely useful and beneficial even if the column contains thousands of distinct values”.

On the surface, this may seem like a somewhat strange thing to suggest to many folk. It seems to be stated in numerous places that Bitmap indexes are only really beneficial with columns that have low numbers of distinct values.  For example, a column called GENDER (I was going to use another word but I have to deal with enough spam messages as it is :) ) has only a few distinct values, so it would be perfect for a Bitmap Index.

Columns that have say 5 or 10 or maybe 20 distinct values should all be OK. But what if a column has 100 distinct values, that might just be pushing it. A column with 1000 distinct values would obviously be totally inappropriate. I would have to be some kind of deranged fool for even contemplating and suggesting a column with 10,000 distinct values, right !!

A Bitmap Index is actually a B-Tree index in it’s basic structure and shape. It has index branch blocks that point to index leaf blocks that have all the necessary index information stored in an ordered and sorted manner. However, in the leaf blocks, a conventional B-Tree index basically stores the indexed column values followed by its corresponding rowid. A bitmap index differs considerably and basically stores in the leaf blocks for each distinct column, the column value followed by a starting and ending rowid that specifies a range of possible rowids within the table followed by a series of bits that denotes for each possible rowid within the range whether the row contains the column value (1) or not (0). If the index entry is larger than roughly half the index block size, another bitmap “piece” is created for the index entry, specifying a different range of rowids with corresponding bitmaps.

The “biggest” component of the index entry is thus this series of bits. But most of the values will be zeros (as a specific row can only have at most the one value of the column) and all these zeros can be compressed quite efficiently by Oracle within the index entry.

So having lots of distinct column values means having lots of index entries with lots of rowid ranges with lots of bitmaps. So a column with anything approaching 10,000 values would be totally inappropriate, right ?

Well take a look at this demo comparing a B-Tree Index vs. a Bitmap Index for a column that has 10,000 distinct values and you might just be surprised.

The table contains 1 Million rows and one of the columns is a NUMBER field that has 10,000 distinct values (hence a Density value of 1%).

The B-Tree Index required 2,090 leaf blocks to store the index and an equality query returning the expected 100 rows requires 19 consistent reads. Not too bad.

However, the equivalent Bitmap Index required just 56 leaf blocks and an equality query returning the same 100 rows does so with just 11 consistent reads.

Ummm, perhaps bitmaps indexes don’t require such low numbers of distinct column values to be useful after all …

A few points to ponder on from this specific example.

The B-Tree index had to store 1,000,000 index values, once for each and every not null row in the parent table. The Bitmap Index only had to store the index values 10,000 times, once for each unique occurrence of the column value (although there may be times when this ratio may be higher)

The B-Tree index had to stored 1,000,000 rowids, once for each and every index row entry. The Bitmap Index only had to store a pair of rowid values for each unique occurrence of the column value (although there may be times when Bitmap Indexes need to store more than one pair of rowids per index value).

If the rows in the table are clustered together based on the index column value, it means the zeros in the bitmap index can be nice and continuous within the bitmap string and so will compress nicely. Therefore, the manner in which the rows are ordered in the table will have a direct impact in how efficient the Bitmap Index can be compressed.

There’s lots and lots of interesting things to discuss about Bitmap Indexes. For now, just note the next time you read Bitmap Indexes should only be used for columns with few distinct values, you may want to get some clarification on what is meant exactly by “few” …

Comments»

1. Fernando J. Andrade - February 1, 2008

What about concurrency in your example?. I don’t see a big deal using a BMI with a medium-high cardinality, if what you`re looking for is improving logical queries ( and OR maybe ). You`re making a sort of matrix. But what a about concurrency?, it`s a big point.

2. Richard Foote - February 1, 2008

Hi Fernando

Concurrency is not just a big point, it’s everything !!

Bitmap indexes have massive blocking implications and are totally and utterly unsuitable in anything approaching a OLTP system with concurrent DML.

There’s been much improvement in how Bitmap Indexes structurally handles DML, but the locks, my God, the locks the locks …

This is one of those “interesting things” that I intend to cover at some point.

3. Richard Foote - February 1, 2008

Hi Fernando

I just checked out your blog and noticed you’re from Madrid.

I lived in Madrid as a small child and my mother and all her side of the family are all from Madrid !!

It’s been way too long since I visited :(

4. Ben in Boston - February 1, 2008

How well does this concept of density scale? Say I had a data warehouse with a table that had 50 billion rows and a 5 million distinct values in a column for a density of 1/10,000. Should I just expect it to fall down dead, or could it still work?

5. Ben in Boston - February 1, 2008

Oh, by the way, that table would be range-hash subpartitioned, the index would be the hash partitioning key. That would probably mean that I’d want to keep the index local, rather than global.

6. Richard Foote - February 1, 2008

Hi Ben

It would likely still scale.

That would be 50 billion rowids that have to be stored in a B-tree index. With partitioned tables, these rowids could suddenly become more expensive too btw if the index were to be global as we now have to store 10 bytes for the rowid. That would be 50 billion index values that would need to stored as well (assuming no nulls although nulls are stored in the bitmap index).

With the Bitmap index, you may only have a (relatively few) 5 million rowid pairs although this would depend on how the data is ordered and the efficiency of the compression. With a table that large, you would have multiple bitmap pieces per index entry but still there’s a 1/10000 ratio to catch up.

In a Warehouse, you also have the advantages of the potentially efficient execution plans open to you with multiple Bitmap indexes and Star Transformations without expensive “on-the-fly” bitmap transformations.

The answer is yes, it could very much still work.

7. Neil Johnson - February 1, 2008

“Therefore, the manner in which the rows are ordered in the table will have a direct impact in how efficient the Bitmap Index can be compressed.”

I couldn’t that go without trying it (hope this formats okay):

CREATE TABLE big_dwh_table2 as select * from big_dwh_table order by dbms_random.value;
CREATE BITMAP INDEX big_dwh2_album_id_i ON big_dwh_table2(album_id);
exec dbms_stats.gather_table_stats(ownname=>null,tabname=> ‘BIG_DWH_TABLE2′, estimate_percent=> null, cascade=> true, method_opt=> ‘FOR ALL COLUMNS SIZE 1′);

INDEX_NAME DISTINCT_KEYS NUM_ROWS LEAF_BLOCKS BLOCKS
—————————— ————- ———- ———– ———-
BIG_DWH2_ALBUM_ID_I 10000 10000 500 512

SELECT * FROM big_dwh_table2 WHERE album_id = 42;

Statistics
———————————————————-
103 consistent gets

Right you are then :)

8. Richard Foote - February 1, 2008

Hi Neil

Precisely !!

I’m planning on discussing the importance of column ordering and the usefulness of MINIMIZE RECORDS_PER_BLOCK to the efficiency of bitmap indexes “sometime”.

BTW, what you’ve done is a perfect example of why I like to show these little demos and illustrations. I show why and how I might suggest something might work. You can then run the same thing in your own environments and then experiment and play around and see for yourself whether something is applicable or not :)

9. Fernando J. Andrade - February 1, 2008

Well I`m not from Madrid (as I`m not a spaniard, I was born at Ecuador!). Actually I live in Madrid LOL. Greetings from here!.

10. Richard Foote - February 1, 2008

Hi Neil

BTW, thought it worth pointing out that by re-ordering the rows in the “random” manner that you did, you of course stuffed up the Clustering Factor of the index big time (you bully ;) )

This of course also impacts the efficiency of the B-Tree index as well as it now has to visit all these different table blocks to retrieve the data as well.

Just ran your example and while the Bitmap has now gone to 102 CR, the B-Tree has gone to 110 CR

11. Peter Scott - February 2, 2008

moving slightly away from the point…
In isolation an index on GENDER may not be a good bitmap index key choice – probably (?) just two key values and depending on data source potentially similar row counts for each key value (OK, men in convents will have a low man count) So the index could be potentially selecting say a third to two thirds of the table (sweeping generalisation alert) and the CBO might decided that visiting the table without using the index to be more efficient.
But in association with other bitmap indexes on the same table it could be useful – we can AND or OR bitmaps together; so a GENDER bitmap could be useful to find the woman that lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49, which is how I was alleged to have met my wife :-)

12. Richard Foote - February 2, 2008

Hi Peter

Absolutely.

I start the Bitmap Index section in my seminar introducing two related myths. One, as discussed, that Bitmap Indexes can only be used for columns with very small numbers of distinct values. And two, that Bitmap Indexes can generally be used in isolation with such columns.

Just one little comment, beware the possible ramifications of OR conditions. In your example, if you had said someone who “lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49″ OR is a woman, then “OK Computer” it ain’t :)

13. Peter Scott - February 2, 2008

:-)

OR tends to be best used with multiple values from the same index and not between indexes – or indeed I would have a larger choice of wife than I did.

I think OR works well with “bought product A OR product B” queries where the product is stored in a single bitmap indexed column. I certainly would not look at “is RED or FEMALE” unless I had a very strong business case… but stranger business cases do exist…

14. Richard Foote - February 2, 2008

Hi Peter

I’m tempted to make a witty remark but I’m worried your wife might read it !! ;)

15. John - February 13, 2008

We have an implementation of bitmap index that work very well with columns that have millions of distinct values and the software is being distributed under LGPL. If you can use a C++ program for your application, you can try it at .

16. Richard Foote - February 13, 2008

Hi John

Indeed. If the tables are big enough, the number of distinct values that could be effectively indexed through a bitmap index is open ended.

17. K. Wu - March 7, 2008

Hi, Richard,

The key to keep bitmap indexes in ORACLE is compression — ORACLE owns a large number of compression pattens if one evidence of this. With some compression methods, one can actually prove with theoretical analysis about the sizes of compressed bitmap indexes, e.g., http://doi.acm.org/10.1145/1132863.1132864 . Some Sybase and IBM products have take a different way of dealing with this size issue caused by high number of distinct values, http://doi.acm.org/10.1145/253260.253268 . The Wikipedia article on bitmap index, http://en.wikipedia.org/wiki/Bitmap_index, captures some of the key points. If you have to stay with ORACLE, my observation is that its compression method is actually pretty effective. Of course, personally, I would like to see them adopt a better one.

18. praveen - June 18, 2008

SELECT 1
FROM plug_in_inter_stage pl
WHERE pl.conv_parent_slot_order = :b1
AND (pl.conv_validated = ‘N’ OR pl.conv_hier_valid = ‘N’
AND pl.conv_company_id = sl.conv_company_id
AND pl.conv_parent_slot_eid = sl.conv_legacy_system_key

i want to tune above query pls help me out

19. Richard Foote - June 18, 2008

Hi Praveen

Sorry, this is not the right place for such questions.

Tips though when you do ask it somewhere where it may get answered:

Provide some details as to why it requires tuning, such as the execution plan
Provide details on the size of the table
Provide details of the cardinalities of the columns
Provide details of available indexes

But most important of all, provide a SQL statement that actually makes sense and will compile. The code you supplied makes references to a table alias sl which is not in the FROM list.

You can’t tune something that won’t run …