Myth: Bitmap Indexes With High Distinct Columns (Supermassive Black Hole) March 3, 2010
Posted by Richard Foote in Bitmap Indexes, CBO, Clustering Factor, Oracle Indexes, Oracle Myths.trackback
As discussed in my previous post, it’s a silly myth to suggest a bitmap index should only be used for so-called “low cardinality” columns else the resultant index would be “huge”. I thought it might be worth a quick refresh to see a simple little demonstration why such claims are a nonsense. There is in fact no limit to the number of distinct values by which a column could be considered a candidate for a bitmap index.
I’ve already shown how a bitmap index on a column with 10,000 distinct values could potentially be smaller than an index with just 4 distinct values, even with the same data type and size. The number of distinct values in a column is but one small consideration, the number of rows in the table and the average ratio of repeated values are just as important. The other important consideration that can significant impact the size of a bitmap index is the distribution and clustering of the data within the table as we shall see.
Using the same demo as the previous post, a reminder of the size of our bitmap index on a column with 10,000 distinct values in a 1 million row table.
SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks from dba_indexes where index_name = 'BIG_BOWIE_CODE_BITMAP_I'; INDEX_NAME INDEX_TYPE DISTINCT_KEYS BLEVEL LEAF_BLOCKS ------------------------- ---------- ------------- ------- ----------- BIG_BOWIE_CODE_BITMAP_I BITMAP 10000 1 52
Let’s now see if the CBO will actually use this bitmap index on its own:
SQL> select * from big_bowie where code = 42; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 4280385727 -------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 7300 | 22 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | BIG_BOWIE | 100 | 7300 | 22 (0)| 00:00:01 | | 2 | BITMAP CONVERSION TO ROWIDS| | | | | | |* 3 | BITMAP INDEX SINGLE VALUE | BIG_BOWIE_CODE_BITMAP_I | | | | | -------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 6 consistent gets 0 physical reads 0 redo size 1854 bytes sent via SQL*Net to client 396 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
Absolutely it does. As there are on average just 100 rows per distinct value, that’s a small enough selectivity for the CBO to use the bitmap index on its own. Note the query has used just 6 consistent gets to return the 100 rows of data.
Let’s now drop the bitmap index and see how a regular B-Tree index would compare and perform:
SQL> drop index big_bowie_code_bitmap_i; Index dropped. SQL> create index big_bowie_code_i on big_bowie(code); Index created. SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks, clustering_factor as cf from dba_indexes where index_name = 'BIG_BOWIE_CODE_I'; INDEX_NAME INDEX_TYPE DISTINCT_KEYS BLEVEL LEAF_BLOCKS CF -------------------- ---------- ------------- ------- ----------- ---------- BIG_BOWIE_CODE_I NORMAL 10000 2 2090 10895
The first thing we notice is that the B-Tree index is significantly larger than the equivalent bitmap index. 1090 leafs blocks whereas the bitmap index was only 56 leaf blocks. So it’s not the bitmap index that’s so-called “huge” here on a column with 10,000 distinct values but the corresponding B-Tree index. Notice also that the Clustering Factor of the index is quite good at 10,895 in a 1 million row table.
If we now run the same query as before:
SQL> select * from big_bowie where code = 42; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1856845120 ------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 100 | 7300 | 5 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| BIG_BOWIE | 100 | 7300 | 5 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | BIG_BOWIE_CODE_I | 100 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 8 consistent gets 0 physical reads 0 redo size 1854 bytes sent via SQL*Net to client 396 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
The query also uses the index but as the B-Tree index is somewhat larger, the overall number of consistent reads has increased from 6 up to 8. So not only is the bitmap index substantially smaller despite having 10,000 distinct values, but it’s also more efficient to use as a result.A key reason why the bitmap index is so small and compact is because the effective Clustering Factor of the indexed column is excellent. The data was inserted into the table in CODE column order and so all the values with the same CODE value are grouped, ordered and clustered together within the table. Within the bitmap index, this means there are large and few grouping of zeros (0) that can be compressed extremely efficiently.
For example, for the first CODE value of 1, the bitmap sequence would look something like:
111111 …. 1110000000000000000000000000000000……..000000
with the 100 values of 1 (true) grouped together followed only by a series of effectively 999,900 zeros (o representing false). The 999,900 zeros can be compressed back almost nothing at all. Note there could actually be somewhat more false (0) values than actual rows in the table but that’s a discussion for another day.
The next value of 2 would have a bitmap sequence something like:
00000000….0001111111…11111100000000000000000000…0000
with 100 zeros followed by 100 1s followed by effectively 999,800 zeros, with again the two grouping of zeros compressed down to almost nothing at all.
And so on …
Let’s now create a different table that contains the identical data but this time with the CODE values effectively randomized throughout the table. The Clustering Factor of the CODE column in this case will be terrible:
SQL> create table big_bowie_2 as select * from big_bowie order by mod(id,100); Table created. SQL> create bitmap index big_bowie_2_code_bm_i on big_bowie_2(code); Index created. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BIG_BOWIE_2', estimate_percent=>null, method_opt=> 'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks from dba_indexes where index_name = 'BIG_BOWIE_2_CODE_BM_I'; INDEX_NAME INDEX_TYPE DISTINCT_KEYS BLEVEL LEAF_BLOCKS ---------------------- ---------- ------------- ------- ----------- BIG_BOWIE_2_CODE_BM_I BITMAP 10000 1 520
The first thing we notice is that the bitmap index is now substantially larger than it was previously. It’s gone from just 52 leaf blocks all the up to 520 blocks, a whole 10 times larger than before. The reason is all to do with the clustering of the data as now values for CODE are littered all over the table and are no longer grouped together.
The bitmap sequence for the first CODE value of 1 would now look something like:
00000000000100000000000000000..0000010000000000000000…0000010000000000001000000000…00000100000….
with the 1s now littered all over the place. This means it’s far less efficient to compress the groups of zeros as there are now substantially more such groupings than before. The index is now 10 times larger as a result.
If we now run the same query as before:
SQL> select * from big_bowie_2 where code = 42; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1324437154 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 100 | 7300 | 22 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID | BIG_BOWIE_2 | 100 | 7300 | 22 (0)| 00:00:01 | | 2 | BITMAP CONVERSION TO ROWIDS| | | | | | |* 3 | BITMAP INDEX SINGLE VALUE | BIG_BOWIE_2_CODE_BM_I | | | | | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 3 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 103 consistent gets 0 physical reads 0 redo size 1854 bytes sent via SQL*Net to client 396 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
Although the CBO is still using the bitmap index on its own, the performance of the query has deteriorated substantially with the number of consistent gets increasing from 6 all the way up to 103.
So the bitmap index is now nowhere near as efficient. A bitmap index isn’t that great with large numbers of distinct values if the associated clustering is poor and so the B-Tree index is the way to go after all, right ?
Well just wait a minute. If the clustering is poor for a bitmap index, the clustering will likewise be poor for a corresponding B-Tree index as well. Most of these consistent reads are due to reading the data out of the table, not from reading the index directly. So the performance of using an equivalent B-Tree index is also likely to be impacted as well.
Let’s compare the difference with a B-Tree index:
SQL> drop index big_bowie_2_code_bm_i; Index dropped. SQL> create index big_bowie_2_code_i on big_bowie_2(code); Index created. SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks, clustering_factor as cf from dba_indexes where index_name = 'BIG_BOWIE_2_CODE_I'; INDEX_NAME INDEX_TYPE DISTINCT_KEYS BLEVEL LEAF_BLOCKS CF ---------------------- ---------- ------------- ------- ----------- ---------- BIG_BOWIE_2_CODE_I NORMAL 10000 2 2090 999922
The first thing to note here is that the B-Tree index is still 2090 leaf blocks in size. Even compared with the now far less efficient Bitmap index, at 520 leaf blocks it’s still approximately just 25% the size of the B-Tree index. So the 10,000 distinct value bitmap index, even when it’s as inefficient as possible, still can’t be described as “huge” here as it’s still only a 1/4 the size of the corresponding B-Tree index. With a Clustering Factor of 999,992 on a 1 million rows table, it doesn’t really get much worse than that and yet the Bitmap index on a 10,000 distinct column is still way ahead of the B-Tree index.
Let’s see how the query performs now:
SQL> select * from big_bowie_2 where code = 42; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2550134594 -------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost(%CPU)| Time | -------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 7300 | 103 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID| BIG_BOWIE_2 | 100 | 7300 | 103 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | BIG_BOWIE_2_CODE_I | 100 | | 3 (0)| 00:00:01 | -------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 105 consistent gets 0 physical reads 0 redo size 1854 bytes sent via SQL*Net to client 396 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
As we can see, the performance of the B-Tree index has likewise deteriorated with such a bad Clustering Factor. At 105 consistent gets, it’s still worse than the corresponding Bitmap index which only needed 103 consistent gets.
With a Bitmap index that is as inefficient as it can be, on a column that has 10,000 distinct values in a table of only 1 million rows, the Bitmap index still outperforms the corresponding B-Tree index.
It’s a complete myth and utter nonsense that a bitmap index is only suitable for low cardinality columns and would become “HUGE” and be 100s of times slower if created on so-called high cardinality column
Let me repeat: There is no limit to the number of distinct values in a column for it to be considered for a Bitmap index.
Excellent demonstration, I am currently considering testing bitmap indexes on some of our tables. I have noticed some execution plans of current SQL queries contain the following operations:
TABLE ACCESS BY INDEX ROWID | T1 |
BITMAP CONVERSION TO ROWIDS | |
BITMAP OR | |
BITMAP CONVERSION FROM ROWIDS | |
INDEX RANGE SCAN | T2 |
BITMAP CONVERSION FROM ROWIDS | |
INDEX RANGE SCAN | T3 |
Do these operations indicate an area where we might consider using bitmap indexes?
LikeLike
Hi Tony
Yes it does. Oracle is basically creating the bitmaps on the fly from a couple of B-Tree indexes and using the bitmap OR operation to determine the matching set of rowids.
Note as always though, that if the table has concurrent DML requirements, then the locking issues likely introduced by the bitmap indexes are more trouble than they’re worth.
LikeLiked by 1 person
Richard,
thank you for this Blog and all the information on the behaviour of indexes you provide. I have one question converning the last sentence: “There is no limit to the number of distinct values in a column for it to be considered for a Bitmap index.” So what about the most extreme case: I guess a bitmap index on a unique column would be bigger than a corresponding b*tree index (and contain a lot of 0-values and a single 1-value for every bitmap, so the order of the table would have no impact on the size of this index); indeed I did not only guess but also build a test case to verify the assumption. So my question is: could you imagine a reason for defining a bitmap index on a unique column?
Regards
Martin Preiss
LikeLike
Hi Martin
No, I can’t imagine such a reason.
To start, Oracle doesn’t support an actual Unique Bitmap index. That’s because a Unique index must be more efficient as a B-Tree than a bitmap because an B-Tree index (on one column), only has to store:
The indexed value + 1 byte
The rowid
2 Bytes of overhead
However, a corresponding bitmap needs to store:
The index value + 1 byte
2xRowids
The bitmaps for the index entry
So a bitmap index has to be larger and less efficient than the B-Tree by definition.
You need the column to have on average a number of repeated values for it to be considered as a bitmap index, if the column is unique or very close to being unique then you’ve gone too far the other way.
That’s why a column that has say 10 million distinct values needs to be in a table with (say) 100 million rows for it to be considered. If the table only has 10.0001 million rows, then it would be too unique to be considered as a bitmap.
LikeLike
Richard,
thank you for the detailed explanation – I was quite sure there was no good reason, but wanted to get additional confirmation… (so it was more of academical interest than of practical relevance)
Regards
Martin Preiss
LikeLike
Allow me a caution message: the reluctance to use bitmap indexes for high-cardinality columns is not so much to do with the size of the index.
It’s mostly to do with what happens to the index when it is subjected to changes – update/insert/delete.
Fact is: data changes. Few of us have the luxury of static data to deal with, even in DW environments.
That’s where the problems usually show up with bitmap. I agree entirely that the size argument for high-cardinality is not as relevant *if* care is taken with data distribution of said column.
But on the field very few tables lend themselves to that sort of arrangement.
What I often observe is what I call pseudo-organized distribution: someone loads a table with a given initial distribution and then things go pear-shaped as updates to the table change that distribution.
The result is a horrible salad where part of the table’s data matches the physical arrangement that favours one type of index whereas the rest is a worst-case example.
Then it all becomes a maintenance nightmare with frequent “re-orgs” thrown in.
I think there might be a place here to investigate the effect of methods of storage that enforce a given physical organization: IOTs and clustering. Then the dba may be guaranteed that future data volatility will not stuff up a carefully arranged distribution to favour one or other type of indexing.
Thanks for this great series on bitmap indexing, very thought provoking.
LikeLike
Hi Noons
It depends somewhat on the distribution of values as I’ve shown, but it also depends very much on the size of the table. In my example above, it doesn’t really matter what the distribution might be, even in the worst case scenario, the bitmap index is considerably better than a corresponding B-Tree.
If a table is large (say 100 Million rows) then likely any column with 1 million distinct values would outperform a B-tree, regardless of the distribution.
If a table is subjected to too much DML and the DML is concurrent, then the locking implications alone rule out the bitmap index, no matter how tiny it might be in comparison.
Certainly if the index is borderline and dependant on data clustering to be effective, then the bitmap index is likely more trouble than it’s worth if you need to reorg all the time, except of course for DW tables which are purged and repopulated, then it might not be such an issue ensuring the data is populated in a specific order that benefits a particularly important index.
As always, it depends.
LikeLike
How true! Thanks heaps for the clarification, it is much appreciated.
LikeLike
Noons
Structuring the data order is not the complete answer – in a DW with typically many bitmaps on (say) the dimensional keys of the fact table it is unlikely that the optimal ordering for one column is the same ordering for another column. True, correlation between columns is often a help – customers shop at nearby stores and certain products are sold in certain countries for examples but in the main we are going to get the odd big bitmap along with the small.
There are things in the DW mange process we can do to help the problem of updates overtime – especially if you have paid for the luxury of partitioning where we can devise low user impact ways of keeping the bitmaps as optimal as possuble
LikeLike
Hi Pete
That’s of course correct, you order one way and benefit one index but you potentially stuff up the clustering of another index in the process.
Partitioning can be useful as indeed can the minimize records_per_block in reducing the index size.
LikeLike
Hi,
when using partitioning + min records per block in order to save on bitmap index size, just be aware of the infamous ORA-14643 – when trying to do an exchange partition data load.
essentially, the swapped-in table/partition have to match in their rows-per-block ratio, if you are using minimize records per block. I consider that hardly feasible.
Karsten
LikeLike
Thanks, Pete. I’m finding partitioning more and more useful in our DW, since we finally managed to grab the licence from management!
With partitions and sub-partitions, we now got a few of our mid-size tables to not need a lot of the previous indexing at all!
Turns out with a little bit of query predicate juggling, we can make 90% of the monthly processing hit a single or couple of sub-partition(s), which we are happy to full scan in most cases as it is sufficiently small.
Before, we were indexing all over the place to get to the same working set size!
The improvement in disk space, I/O overhead and maintenance is absolutely overwhelming. And the management is over the moon with the execution time reductions!
We still need indexing, of course. But it is much reduced and a lot more efficient, now. I’m now investigating the use of bitmaps for further improvements, hence the interest Richard’s series has created!
LikeLike
Nice example Sir !!!
Thanks for all your great work!!
One question:
why CBO choose index access after you redistribute the data randomly, even though the clustering factor is almost equal to the no of records?
Regard’s
LikeLike
Hi Noons
No worries. Good to hear from you again.
LikeLike
Hi Henish
Because the number of records to be retrieved is very low, just 100 out of 1 million. So the selectivity is just a small 0.01% and it’s far cheaper to use the index to retrieve 100 rows than perform a FTS.
The other advantage of using a bitmap index on high cardinality columns is that you increase the chance of the CBO actually using the index on its own.
LikeLike
Richard,
That’s more of a statistical side effect than anything else.
The optimizer has an indication of the data scatter for a b-tree index in the clustering_factor; but the clustering_factor is meaningless in a bitmap index – it’s just a copy of the num_rows figure. So for data scatter when doing the bitmap arithmetic the optimizer assumes (approximately) that 20% of the data required will be well clustered and 80% will be widely scattered. (That’s mentioned somewhere in the Oracle Wait Interface book – and I think I demonstrated it in CBO-F).
So if you’ve got data which is very well clustered (better than 20/80) the optimizer may use a b-tree index when it ignores a bitmap index, and if you’ve got data that is badly clustered (worse than 80/20) then the optimizer may use a bitmap when it ignores a b-tree index.
It seems that most people have data with very poor clustering when comparing b-tree indexes to bitmap indexes.
LikeLike
Hi Martin
No worries. A bit of accademic interest is always useful 🙂
LikeLike
[…] dat bitmap indexen alleen geschikt zijn voor kolommen met een beperkte set aan waarden. In deze blog toont Richard Foote aan dat 10.000 verschillende waarden in een tabel met 1.000.000.000 rijen […]
LikeLike
I don’t know German, here’s hoping the reference is positive 😉
LikeLike
I guess it is positive – but it’s Dutch…
LikeLike
It is Dutch & Google knows it 😛
http://bit.ly/c5b1iu
LikeLike
Hi Jonathan
As alway, thanks for the input.
Just to clarify for those peeking in, the num_rows referred by Jonathan is the index num_rows, not the table num_rows. So indeed, the actual CF of a bitmap index is somewhat meaningless and that’s why I only listed it in the above demo when referencing the b-tree index. Therefore Oracle has to make assumptions regarding the actual CF of a bitmap index that might be entirely wrong.
However, getting back to the question raised by Henish, if the selectivity of an index is really low, the CF has reduced impact anyways. For 100 rows, whether the CF is actually high or low, it doesn’t really matter either way, as it’s likely cheaper than performing a FTS on a say 10,000 block table.
BTW, I’m currently in Qantas lounge in Sydney, about to embark on a 24 hour total journey to the US and so on my third glass of champange, as I need a little artifical courage due to my dislike of flying. If nothing of what I’ve said makes any sense, I hope this goes close to explaining why 😉
LikeLike
Richard,
I should have made it clear that my comments were addressed at the comment about turning b-tree indexes into bitmap indexes.
It’s a comment I’ve seen often enough as the wrong answer to a problem that I’d like to put it into the “index myth” list – except that it often works.
The point, which I appreciate is not one that you need to be told, is that:
“because it’s not a bitmap” is NEVER the correct answer to the question “why isn’t Oracle using my index”, but some people have been persuaded that it is.
LikeLike
Hi Jonathan
Understood, and yep I totally agree with you and is actually one of the points I make in my indexing myths presentation.
LikeLike
@ Martin Preiss
LOL !!
Confusing Dutch for German has got to be one of the funniest things I done in a long time 🙂
Thank you for the clarification.
LikeLike
[…] far away — with my own slides. Though, I did manage to focus on bitmap indexes part and the myth of bitmap indexes not working well for columns with high cardinality. Very interesting conclusions. I’m still […]
LikeLike
Hi Karsten
Good point. Certainly something to be aware of.
Thank you.
LikeLike
“Confusing Dutch for German has got to be one of the funniest things I done in a long time :)”
You must be really boring! 🙂
Google translation of the Dutch shows the final paragraph as:
For dessert have a tip for the hard-core Oracle performance tuners, a tip that many bakers actually query should know. It is a myth that bitmap indexes are only suitable for columns with a limited set of values. Richard Foote in this blog shows that 10,000 different values in a table with rows 1,000,000,000 efficiently to consult with a bitmap index with a ‘normal’ b-tree index. Even if the conditions for a bitmap index rather unfavorable. Tie it in the ears!
LikeLike
Hi Brian
LOL !! Thank-you for your efforts.
LikeLike
Too excited to try translate.google.com, i didn’t even scroll till the end of the page. Apologies for inserting duplicate rows 😉
LikeLike
[…] Myth: Bitmap Indexes with high distinct columns (Supermassive Black Hole) […]
LikeLike
Hi,
thanks for this article.
I have a question, why in the first case the cost is higher for the bitmap index than for the B-tree index. Is it correct or wrong?
I can’t figure it out.
LikeLike
Hi Zdenek
That’s a whole new area of discussion (and one I probably should go into one day) !!
Basically, the manner in which the use of bitmap indexes are costed are somewhat different from b-tree indexes, primarily because the clustering factor is meaningless in the manner it’s captured for bitmap indexes. As such, costings of bitmap index accesses are not listed within the execution plan.
A topic worthy of a blog entry on it’s own (see Jonathan Lewis’s comments for a summary of the issue).
LikeLike
[…] hamper the effective compression capabilities of the index (I’ve discuss the impact of the Clustering Factor on the effectiveness of Bitmap Index compression previously). Therefore, it might well be beneficial to more accurately determine the number of […]
LikeLike
[…] not a “low cardinality” column – but, as Richard Foote (among others) has often had to point out, it’s wrong to think that bitmap indexes are only suitable for columns with a very small […]
LikeLike