jump to navigation

Bitmap Indexes With Many Distinct Column Values (Wots…uh the deal) February 1, 2008

Posted by Richard Foote in Bitmap Indexes, Oracle General, Oracle Indexes, Oracle Myths.
trackback

In the seemingly never ending list of 8 things one may not have known about indexes, Number 3 stated:

“Bitmap Indexes can be extremely useful and beneficial even if the column contains thousands of distinct values”.

On the surface, this may seem like a somewhat strange thing to suggest to many folk. It seems to be stated in numerous places that Bitmap indexes are only really beneficial with columns that have low numbers of distinct values.  For example, a column called GENDER (I was going to use another word but I have to deal with enough spam messages as it is :) ) has only a few distinct values, so it would be perfect for a Bitmap Index.

Columns that have say 5 or 10 or maybe 20 distinct values should all be OK. But what if a column has 100 distinct values, that might just be pushing it. A column with 1000 distinct values would obviously be totally inappropriate. I would have to be some kind of deranged fool for even contemplating and suggesting a column with 10,000 distinct values, right !!

A Bitmap Index is actually a B-Tree index in it’s basic structure and shape. It has index branch blocks that point to index leaf blocks that have all the necessary index information stored in an ordered and sorted manner. However, in the leaf blocks, a conventional B-Tree index basically stores the indexed column values followed by its corresponding rowid. A bitmap index differs considerably and basically stores in the leaf blocks for each distinct column, the column value followed by a starting and ending rowid that specifies a range of possible rowids within the table followed by a series of bits that denotes for each possible rowid within the range whether the row contains the column value (1) or not (0). If the index entry is larger than roughly half the index block size, another bitmap “piece” is created for the index entry, specifying a different range of rowids with corresponding bitmaps.

The “biggest” component of the index entry is thus this series of bits. But most of the values will be zeros (as a specific row can only have at most the one value of the column) and all these zeros can be compressed quite efficiently by Oracle within the index entry.

So having lots of distinct column values means having lots of index entries with lots of rowid ranges with lots of bitmaps. So a column with anything approaching 10,000 values would be totally inappropriate, right ?

Well take a look at this demo comparing a B-Tree Index vs. a Bitmap Index for a column that has 10,000 distinct values and you might just be surprised.

The table contains 1 Million rows and one of the columns is a NUMBER field that has 10,000 distinct values (hence a Density value of 1%).

The B-Tree Index required 2,090 leaf blocks to store the index and an equality query returning the expected 100 rows requires 19 consistent reads. Not too bad.

However, the equivalent Bitmap Index required just 56 leaf blocks and an equality query returning the same 100 rows does so with just 11 consistent reads.

Ummm, perhaps bitmaps indexes don’t require such low numbers of distinct column values to be useful after all …

A few points to ponder on from this specific example.

The B-Tree index had to store 1,000,000 index values, once for each and every not null row in the parent table. The Bitmap Index only had to store the index values 10,000 times, once for each unique occurrence of the column value (although there may be times when this ratio may be higher)

The B-Tree index had to stored 1,000,000 rowids, once for each and every index row entry. The Bitmap Index only had to store a pair of rowid values for each unique occurrence of the column value (although there may be times when Bitmap Indexes need to store more than one pair of rowids per index value).

If the rows in the table are clustered together based on the index column value, it means the zeros in the bitmap index can be nice and continuous within the bitmap string and so will compress nicely. Therefore, the manner in which the rows are ordered in the table will have a direct impact in how efficient the Bitmap Index can be compressed.

There’s lots and lots of interesting things to discuss about Bitmap Indexes. For now, just note the next time you read Bitmap Indexes should only be used for columns with few distinct values, you may want to get some clarification on what is meant exactly by “few” …

About these ads

Comments»

1. Fernando J. Andrade - February 1, 2008

What about concurrency in your example?. I don’t see a big deal using a BMI with a medium-high cardinality, if what you`re looking for is improving logical queries ( and OR maybe ). You`re making a sort of matrix. But what a about concurrency?, it`s a big point.

2. Richard Foote - February 1, 2008

Hi Fernando

Concurrency is not just a big point, it’s everything !!

Bitmap indexes have massive blocking implications and are totally and utterly unsuitable in anything approaching a OLTP system with concurrent DML.

There’s been much improvement in how Bitmap Indexes structurally handles DML, but the locks, my God, the locks the locks …

This is one of those “interesting things” that I intend to cover at some point.

3. Richard Foote - February 1, 2008

Hi Fernando

I just checked out your blog and noticed you’re from Madrid.

I lived in Madrid as a small child and my mother and all her side of the family are all from Madrid !!

It’s been way too long since I visited :(

4. Ben in Boston - February 1, 2008

How well does this concept of density scale? Say I had a data warehouse with a table that had 50 billion rows and a 5 million distinct values in a column for a density of 1/10,000. Should I just expect it to fall down dead, or could it still work?

5. Ben in Boston - February 1, 2008

Oh, by the way, that table would be range-hash subpartitioned, the index would be the hash partitioning key. That would probably mean that I’d want to keep the index local, rather than global.

6. Richard Foote - February 1, 2008

Hi Ben

It would likely still scale.

That would be 50 billion rowids that have to be stored in a B-tree index. With partitioned tables, these rowids could suddenly become more expensive too btw if the index were to be global as we now have to store 10 bytes for the rowid. That would be 50 billion index values that would need to stored as well (assuming no nulls although nulls are stored in the bitmap index).

With the Bitmap index, you may only have a (relatively few) 5 million rowid pairs although this would depend on how the data is ordered and the efficiency of the compression. With a table that large, you would have multiple bitmap pieces per index entry but still there’s a 1/10000 ratio to catch up.

In a Warehouse, you also have the advantages of the potentially efficient execution plans open to you with multiple Bitmap indexes and Star Transformations without expensive “on-the-fly” bitmap transformations.

The answer is yes, it could very much still work.

7. Neil Johnson - February 1, 2008

“Therefore, the manner in which the rows are ordered in the table will have a direct impact in how efficient the Bitmap Index can be compressed.”

I couldn’t that go without trying it (hope this formats okay):

CREATE TABLE big_dwh_table2 as select * from big_dwh_table order by dbms_random.value;
CREATE BITMAP INDEX big_dwh2_album_id_i ON big_dwh_table2(album_id);
exec dbms_stats.gather_table_stats(ownname=>null,tabname=> ‘BIG_DWH_TABLE2′, estimate_percent=> null, cascade=> true, method_opt=> ‘FOR ALL COLUMNS SIZE 1′);

INDEX_NAME DISTINCT_KEYS NUM_ROWS LEAF_BLOCKS BLOCKS
—————————— ————- ———- ———– ———-
BIG_DWH2_ALBUM_ID_I 10000 10000 500 512

SELECT * FROM big_dwh_table2 WHERE album_id = 42;

Statistics
———————————————————-
103 consistent gets

Right you are then :)

8. Richard Foote - February 1, 2008

Hi Neil

Precisely !!

I’m planning on discussing the importance of column ordering and the usefulness of MINIMIZE RECORDS_PER_BLOCK to the efficiency of bitmap indexes “sometime”.

BTW, what you’ve done is a perfect example of why I like to show these little demos and illustrations. I show why and how I might suggest something might work. You can then run the same thing in your own environments and then experiment and play around and see for yourself whether something is applicable or not :)

9. Fernando J. Andrade - February 1, 2008

Well I`m not from Madrid (as I`m not a spaniard, I was born at Ecuador!). Actually I live in Madrid LOL. Greetings from here!.

10. Richard Foote - February 1, 2008

Hi Neil

BTW, thought it worth pointing out that by re-ordering the rows in the “random” manner that you did, you of course stuffed up the Clustering Factor of the index big time (you bully ;) )

This of course also impacts the efficiency of the B-Tree index as well as it now has to visit all these different table blocks to retrieve the data as well.

Just ran your example and while the Bitmap has now gone to 102 CR, the B-Tree has gone to 110 CR

11. Peter Scott - February 2, 2008

moving slightly away from the point…
In isolation an index on GENDER may not be a good bitmap index key choice – probably (?) just two key values and depending on data source potentially similar row counts for each key value (OK, men in convents will have a low man count) So the index could be potentially selecting say a third to two thirds of the table (sweeping generalisation alert) and the CBO might decided that visiting the table without using the index to be more efficient.
But in association with other bitmap indexes on the same table it could be useful – we can AND or OR bitmaps together; so a GENDER bitmap could be useful to find the woman that lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49, which is how I was alleged to have met my wife :-)

12. Richard Foote - February 2, 2008

Hi Peter

Absolutely.

I start the Bitmap Index section in my seminar introducing two related myths. One, as discussed, that Bitmap Indexes can only be used for columns with very small numbers of distinct values. And two, that Bitmap Indexes can generally be used in isolation with such columns.

Just one little comment, beware the possible ramifications of OR conditions. In your example, if you had said someone who “lives in Sydney who shops at a certain supermarket and uses a certain cellphone provider and is aged between 40 and 49″ OR is a woman, then “OK Computer” it ain’t :)

13. Peter Scott - February 2, 2008

:-)

OR tends to be best used with multiple values from the same index and not between indexes – or indeed I would have a larger choice of wife than I did.

I think OR works well with “bought product A OR product B” queries where the product is stored in a single bitmap indexed column. I certainly would not look at “is RED or FEMALE” unless I had a very strong business case… but stranger business cases do exist…

14. Richard Foote - February 2, 2008

Hi Peter

I’m tempted to make a witty remark but I’m worried your wife might read it !! ;)

15. John - February 13, 2008

We have an implementation of bitmap index that work very well with columns that have millions of distinct values and the software is being distributed under LGPL. If you can use a C++ program for your application, you can try it at .

16. Richard Foote - February 13, 2008

Hi John

Indeed. If the tables are big enough, the number of distinct values that could be effectively indexed through a bitmap index is open ended.

17. K. Wu - March 7, 2008

Hi, Richard,

The key to keep bitmap indexes in ORACLE is compression — ORACLE owns a large number of compression pattens if one evidence of this. With some compression methods, one can actually prove with theoretical analysis about the sizes of compressed bitmap indexes, e.g., http://doi.acm.org/10.1145/1132863.1132864 . Some Sybase and IBM products have take a different way of dealing with this size issue caused by high number of distinct values, http://doi.acm.org/10.1145/253260.253268 . The Wikipedia article on bitmap index, http://en.wikipedia.org/wiki/Bitmap_index, captures some of the key points. If you have to stay with ORACLE, my observation is that its compression method is actually pretty effective. Of course, personally, I would like to see them adopt a better one.

18. praveen - June 18, 2008

SELECT 1
FROM plug_in_inter_stage pl
WHERE pl.conv_parent_slot_order = :b1
AND (pl.conv_validated = ‘N’ OR pl.conv_hier_valid = ‘N’
AND pl.conv_company_id = sl.conv_company_id
AND pl.conv_parent_slot_eid = sl.conv_legacy_system_key

i want to tune above query pls help me out

19. Richard Foote - June 18, 2008

Hi Praveen

Sorry, this is not the right place for such questions.

Tips though when you do ask it somewhere where it may get answered:

Provide some details as to why it requires tuning, such as the execution plan
Provide details on the size of the table
Provide details of the cardinalities of the columns
Provide details of available indexes

But most important of all, provide a SQL statement that actually makes sense and will compile. The code you supplied makes references to a table alias sl which is not in the FROM list.

You can’t tune something that won’t run …

20. Narendra - August 12, 2009

Hello Richard,

Recently, I was looking into (but not working on, officially) rationale behind using B*Tree indexes (many of them on different combinations of columns) on one of the tables in our application. Being a follower of your blog, I knew you would have something helpful and I found this. Great explanation. Now, coming back to ours table, although this is an OLTP-type of system, this particular table gets populated/updated at regular intervals using a batch job (which translates and derives data before loading/merging it into this table). Once populated, it is queried extensively based on various (but fixed set of) combinations of columns. Recently, the application started facing huge performance issues, either during data upload or data querying. Basically, many (composite) indexes (normal and function-based) was identified as one of the biggest reasons for upload process being slow (i.e. missing SLAs). Reading your article (again) made me wonder if Bitmap indexes on individual columns instead of composite B*Tree indexes, might help. As you have proved, the size of the Bitmap indexes is much less than corresponding B*Tree indexes. I am though not sure if upload process would benifit or have detrimental effect due to Bitmap indexes. I am told that the upload process has been designed/tuned for efficiency. Would you have any pointers?

Richard Foote - August 20, 2009

Hi Narendra

Indexes slow table changes as the indexes need to be updated as well. Depending on your database version, bitmap indexes are likely to be even more problematic as they can cause contention issues and mucho redo/undo and can bloat out in size considerably as well (especially for pre 10g bitmap indexes, less so since 10g).

Depending on usage and requirements, it might be worth investigating making non-critical indexes unsable and rebuild them again after the load. You might end up with a faster overall completion times and more efficient index structures as well.

If your application is OLPT and if you have any additional concurrent DML after the data loads, then I would consider bitmap indexes. The locking issues especially can likely cripple your application.

Narendra - September 11, 2009

Hello Richard,

Thanks for your reply and apologies for my late comment.
The database version is 10.2.0.4 so I believe a bitmap index exploding in size due to many DML operations (and corresponding redo/undo generation) is not much to worry about (although we will definitely have to test and prove it).
Funny thing is, one of the proposed (and implemented) solutions to address severe performance issues is making many existing B*Tree indexes unusable before data load and rebuilding them later. The main driving factors behind suggesting bitmap indexes were
a) As said earlier, this table gets populated/updated using batch job that run at regular intervals. So there is no concurrent DML activity to worry about. In the sense of how data is maintained in this table, it is more similar to the way a DataWarehouse table is maintained than a Transactional table is maintained in OLTP system.
b) Once loaded, this table is extensively accessed using predicates on combination of some (or many) columns in this table. In the current setup, the designer seems to have proposed (and implemented) multiple composite B*Tree indexes that match with WHERE clauses of the queries against this table. This has ensured that all queries use indexes and get executed in sub-second. But it has also resulted in one column being included in multiple indexes and that too at different locations. This effectively means when such a column value is updated, 3-4 indexes will need to be updated.
I was hoping that single column bitmap indexes (on selected columns) instead of multiple composite B*Tree indexes will have following benifits:
a) By definition and due to being on one column, these indexes will be much smaller in size and hence more efficient to rebuild, if required.
b) Since most of the queries against the table have multiple columns in WHERE clause, bitmap indexes will be more effective for AND/OR operations
c) During data load, updating a column (that has a bitmap index) will require updating only a single index.
Hope my thinking is not completely off-the-track here.

21. Narendra - August 12, 2009

Richard,

Apologies for another message. But I was also wondering why does Oracle decide to use B*Tree index instead of Bitmap index, even when cost of B*Tree access is higher than the Bitmap access? I observed this on the table in question.

Richard Foote - August 20, 2009

Hi Narendra

I’m not sure, I would need to see details of the specific queries and associated stats to determine why this might be the case ?

Narendra - September 11, 2009

Richard,

Again apologies for late reply (blame it on my vacation…)
I will see if I can produce a working example. The main issue in producing such example is generating the test data. At present, I am experiencing this behavior on a CHARACTER data that is meaningful and hence the indexes on the same have a pattern (used space, clustering factor etc.). I am not sure how I can produce large volume of meaningful CHARACTER test data so that corresponding indexes follow the same pattern.

22. Richard Foote - September 16, 2009

Hi Narendra

No, I think you’re probably going down the right track, IF IF IF, you indeed don’t have concurrent DML on this table.

With all these things, the best thing to do is to test it all out and see if indeed you do get the expected benefits. Certainly having fewer indexes will speed up data loads, having bitmap indexes will potentially increase flexibilty.

Your own tests and timings will confirm whether you’re going down the right path.

With your other question, a bit more detail would certainly be helpful.

Narendra - September 16, 2009

Richard,

Thanks for your inputs. I am bit confused as what more details are needed and for which other question. Am I correct in assuming that your above (latest) reply is intended to be for my explanation of reasons for possibly using bitmap indexes above?

Richard Foote - September 23, 2009

Hi Narendra

Sorry for being so vague. I was trying to refer to your question on why Oracle appears to be favouring the B-tree over the bitmap index :)

23. Donatello Settembrino - October 9, 2009

Hi Richard,
Recently, my curiosity (even my work) has prompted me to investigate the CBO.

Some obstacles I’ve left behind, with other instead, I’m still fighting.
Studying the bitmap index and trying to understand how to calculate the cost, I found some threads where was explained, that a good approximation it’s given whereas the 80% of clustered data and 20% of scattered data, like so:

dsettembrino>create table t1
2 as
3 select mod((rownum),20) n1,
4 lpad(‘x’, mod((rownum),20)) c1,
5 sysdate – rownum d1
6 from all_objects
7 where rownum create bitmap index t1_i1 on t1(n1);

dsettembrino>begin
2 DBMS_STATS.GATHER_TABLE_STATS (
3 ownname => ‘SETTEMBRINO’,
4 tabname => ‘T1′,
5 estimate_percent => 100
6 );
7 end;
8 /

dsettembrino>SET AUTOTRACE TRACEONLY;

dsettembrino>select /*+ index(t1) */ * from t1 where n1 = 5;

Selezionate 6942 righe.
—————————————————————————
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)|
—————————————————————————
| 0 | SELECT STATEMENT | | 6942 | 142K| 93 (0)|
| 1 | TABLE ACCESS BY INDEX ROWID | T1 | 6942 | 142K| 93 (0)|
| 2 | BITMAP CONVERSION TO ROWIDS| | | | |
|* 3 | BITMAP INDEX SINGLE VALUE | T1_I1 | | | |
—————————————————————————

dsettembrino>set autotrace off;

Whereas:

dsettembrino>select num_rows, blocks , TRUNC(NUM_ROWS/BLOCKS) AS ROWS_FOR_BLOCKS
2 from user_tables
3 where table_name = ‘T1′;

NUM_ROWS BLOCKS ROWS_FOR_BLOCKS
———- ———- —————
138830 262 529

dsettembrino>SELECT ROUND(((0.8 * 6942/529) + (0.2 * 262)), 2) AS COST FROM DUAL;

how
6492 is the row retrieved, 529 rows for block e 262 the blocks of the table

COST
———-
62,9

The cost is about 63 vs. 93, that of plan table.

Is my interpretation correct?
What escapes me?

Thank you a lot

Bye

Donatello


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,715 other followers

%d bloggers like this: