jump to navigation

So What Is A Good Cardinality Estimate For A Bitmap Index Column ? (Song 2) April 13, 2010

Posted by Richard Foote in Bitmap Indexes, Non-Unique Indexes, Oracle Indexes, Oracle Myths.
trackback

As I’ve discussed previously, using a Bitmap index on a unique column makes little sense as the underling index must be larger than a corresponding B-tree index due to the implicit additional overheads associated with Bitmap indexes. As such, Oracle doesn’t permit the use of a Bitmap Index on a declared unique column or to police a unique constraint.  Therefore, some amount of index entry duplication is necessary for a Bitmap index to be considered.

However, an interesting question is how much duplication is actually necessary ? At what point does a Bitmap index have the potential to be equivalent or better than a corresponding B-Tree index ?  The answer will perhaps surprise many, especially those that only consider Bitmap Indexes to be viable for so-called “low cardinality”columns on large tables where there could be many millions of occurrences of each distinct value in a column.
 
If one actually looks at what comprises an index entry in each type of index and understands somewhat how the bitmap column is comprised and effectively compressed, the rough ballpark answer becomes quite easy to determine.
 
Remember, for a non-compressed, non-unique B-Tree index, an index entry comprises:
 
The indexed column or columns (however long the index column values might be)
6 bytes for the rowid (assuming a local index or index on non-partitioned tables)
2 bytes for flag and lock bytes
1 byte for each index column (a minimum of 2 bytes)
 
So that’s 10 bytes plus the size of the actual indexed column for a single column B-Tree index. The key point here however is that there’s an index entry for each and every not null index value.
 
For a Bitmap index, an index entry comprises:
 
The index column(s)
2 x 6 byte rowid
2 bytes for flag and lock bytes
1 byte for each index column (a minimum of 4 bytes)
? bytes for the actual bitmap sequence
 
So the additional overheads comes down to the additional 6 byte rowid and the length of the bitmap column. The key point here though is that there may only need be one index entry (or Bitmap index “piece”) for each distinct indexed value. However, if the number of occurrences of each index value is very low (eg: say single figures), then it’s almost certain only one bitmap index entry (piece) would be necessary for each indexed value.
 
The number of bytes required for the actual bitmap column depends on many factors, including the number of occurrences of each indexed value and the clustering of the data in the table. However again, if the number of occurrences of each index value is very low (eg: say single figures), it means the vast majority of bits are 0 (false) within each bitmap sequence and so can be compressed extremely efficiently. If there are only a handful of 1 (true) bits within a bitmap index entry, the bitmap column is going to be tiny and effectively compressed to only a few bytes.
 
Therefore, the actual additional overheads for a bitmap index with few repeated values is only the 7 byte overhead for the additional rowid and its length and a few bytes for the actual bitmap column. But remember, this single bitmap index entry can cater for all occurrences of the indexed column, whereas the B-Tree index requires an index entry for each and every occurrence of the index column.
 
It doesn’t take much for these additional overheads for each Bitmap index entry to start to cancel out …
 
If we look at an indexed column (say length 4 bytes) that has on average just the one duplicate value:
 
Total for a B-Tree Index would be:
 
4 bytes index column
6 bytes rowid
2 bytes for each index column length byte(remembering the rowid is an index column in a non-unique index)
2 bytes for flag and lock bytes
 
= 14 bytes x 2 (for each index entry as there’s a duplicate value) = 28 bytes in total
 
For a corresponding Bitmap index:
 
4 bytes index column
12 bytes for the two rowids
4 bytes for each index column length byte (remembering the rowids and bitmap sequence are effectively additional indexed columns)
2 bytes for flag and lock bytes
2 bytes is all it takes for the bitmap sequence column (if there’s only 2 actual true bits per index entry)
 
= 24 bytes in total.
 
So we’re already in a position for a Bitmap index to potentially be the smaller and more efficient index type, even when there’s only just one duplicate on average per index column value …
 
If we have another duplicate value (so there are on average 3 occurrences of each index value), then the overheads for such a B-Tree becomes:
 
3 x 14 bytes = 42 bytes
 
but the overheads for the bitmap index only increases by a byte or so for the necessary increase in the bitmap column. So the difference in space between the two index types starts to widen significantly.
 
Obviously, the size of the index column becomes a factor in the potential savings with a Bitmap index as it only has to potentially store the index column once whereas the (non-compressed) B-Tree index needs to store all occurrences of the index value. To illustrate the comparative differences between a B-Tree and a Bitmap Index, I’m going to create various tables with columns that have different levels of cardinality and different clustering attributes for their indexed columns and compare the size differences between B-Tree and Bitmap indexes. The indexed column will simply be a small NUMBER type column to make it just that bit harder for the Bitmap index to be the more efficient.

In the first example, 1/3 of all values are unique while the remaining 2/3 of values have just 2 occurrences of each value. The column is certainly not unique but is arguably “approaching” uniqueness.
Initially, the indexed column is very well clustered within the table (although with a bitmap index, the clustering factor in the index statistics is useless as it simply denotes the number of index entries within the index).
 

SQL> create table bowie as select rownum id, ceil(rownum/1.5) key, 'David Bowie' name from dual connect by level <= 1000000;
 
Table created.
 
SQL> select count(distinct key) from bowie;
 
COUNT(DISTINCTKEY)
------------------
            666667
 
SQL> create bitmap index bowie_bitmap_i on bowie(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BITMAP_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BITMAP_I           2129            666667
 
SQL> drop index bowie_bitmap_i;
 
Index dropped.
 
SQL> create index bowie_btree_i on bowie(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BTREE_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BTREE_I            1999              3738

 

Although some may claim such a Bitmap index, one with 666,667 distinct values in a 1 million row table would be thousands of times larger and less efficient than an equivalent B-Tree index, it’s actually quite a close call. The bitmap index is only 124 leaf blocks different or approximately 6.5% larger in size than the B-Tree index.
 
If we create an equivalent table but this time with the clustering of the data all over the place:
 
 


SQL> drop table bowie2;
 
Table dropped.
 
SQL> create table bowie2 as select * from bowie order by mod(id, 100);
 
Table created.
 
SQL> create bitmap index bowie2_bitmap_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BITMAP_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE2_BITMAP_I          2246            666667
 
SQL> drop index bowie2_bitmap_i;
 
Index dropped.
 
SQL> create index bowie2_btree_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BTREE_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE2_BTREE_I           1999            999722

 

We notice that the B-Tree index remains the same size but the Bitmap index is now a little larger by 117 additional leaf blocks. So even with an awful clustering, the Bitmap index is only 247 leaf blocks or approximately 12.3% larger than the B-Tree index. So the B-Tree just wins out in this case, the column is still just that bit too unique for the Bitmap index where we have 666,667 distinct values in a 1 million row table.
 
OK, let’s see how the indexes compare when the index column has on average 1 duplicate for each and every indexed value. There are just 2 values for each and every indexed value, 500,000 distinct values in a 1 million row table: 
 

 
SQL> drop table bowie;

Table dropped.

SQL> create table bowie as select rownum id, ceil(rownum/2) key, 'David Bowie' name from dual connect by level <= 1000000;

Table created.

SQL> create bitmap index bowie_bitmap_i on bowie(key) pctfree 0;

Index created.

SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BITMAP_I';

INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BITMAP_I           1628            500000

SQL> drop index bowie_bitmap_i;

Index dropped.

SQL> create index bowie_btree_i on bowie(key) pctfree 0;

Index created.

SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BTREE_I';

INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BTREE_I            1998              3728

 

OK, now the Bitmap index is well ahead. On a well clustered column that has 500,000 distinct values in a 1 million row table, the B-Tree index is now larger by 370 leaf blocks or by 22.7%. What if the data is poorly clustered:


SQL> drop table bowie2;
 
Table dropped.
 
SQL> create table bowie2 as select * from bowie order by mod(id, 100);
 
Table created.
 
SQL> create bitmap index bowie2_bitmap_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BITMAP_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE2_BITMAP_I          1806            500000
 
SQL> drop index bowie2_bitmap_i;
 
Index dropped.
 
SQL> create index bowie2_btree_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BTREE_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE2_BTREE_I           1998            999803

 

OK, the Bitmap index is now larger by 178 leaf blocks than it was before, but the equivalent B-Tree index is still larger by 10.6%.
 
Again, this is on a relatively small, extremely poorly clustered indexed column that has 500,000 distinct values in just a 1 million row table. The Bitmap index is smaller and more efficient than the equivalent B-Tree index.
 
If we now use a column that has on average 4 occurrences for each distinct column value (with 250,000 distinct values in a 1 million row table), the differences between a Bitmap and a B-Tree index begin to widen significantly.
 


SQL> drop table bowie;
 
Table dropped.
 
SQL> drop table bowie2;
 
Table dropped.
 
SQL> create table bowie as select rownum id, ceil(rownum/4) key, 'David Bowie' name from dual connect by level <= 1000000;
 
Table created.
 
SQL> create bitmap index bowie_bitmap_i on bowie(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BITMAP_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BITMAP_I            829            250000
 
SQL> drop index bowie_bitmap_i;
 
Index dropped.
 
SQL> create index bowie_btree_i on bowie(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE_BTREE_I';
 
INDEX_NAME        LEAF_BLOCKS CLUSTERING_FACTOR
----------------- ----------- -----------------
BOWIE_BTREE_I            1995              3753

 

On a well clustered column with 250,000 distinct values in a 1 million row table, the Bitmap index is less that 1/2 the size of that of an equivalent B-Tree index. If the data were less so clustered:


SQL> create table bowie2 as select * from bowie order by mod(id, 100);
 
Table created.
 
SQL> create bitmap index bowie2_bitmap_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BITMAP_I';
 
INDEX_NAME      LEAF_BLOCKS CLUSTERING_FACTOR
--------------- ----------- -----------------
BOWIE2_BITMAP_I        1145            250000
 
SQL> drop index bowie2_bitmap_i;
 
Index dropped.
 
SQL> create index bowie2_btree_i on bowie2(key) pctfree 0;
 
Index created.
 
SQL> select index_name, leaf_blocks, clustering_factor from dba_indexes where index_name = 'BOWIE2_BTREE_I';
 
INDEX_NAME      LEAF_BLOCKS CLUSTERING_FACTOR
--------------- ----------- -----------------
BOWIE2_BTREE_I         1995            999881

 

Even with a really poor clustered index column, the B-Tree index is still some 74% larger than the Bitmap index with some 250,000 distinct values.
 
Despite many claims to the contrary, including the rewrite of the awful Burleson article  that started this series on Bitmap Indexes, where it still claims that “there are rare cases where a bitmap on a column with 10,000 key values might be appropriate” and “in most cases, there will not be a lot of adjacent index values, so it quite rare to see extensive compression“, in reality it’s not that rare at all for a column with “large” numbers of distinct values to be indexed effectively via a Bitmap Index. But, you actually need to understand Bitmap Indexes to appreciated this fact and have at least some understanding on how Oracle stores and compresses the bitmap column. The Burleson description of Bitmap index compression in the above article is totally wrong and so hence are its overall conclusions. 

Here’s an actual, “real life” example. On a 2.2 million row table with a column on people last names, (name columns are often considered way too distinct for Bitmap indexes), there were approximately 6.5 occurrences of each last name on average over the whole table. The most compact B-Tree index on the Last Name column required 5286 leaf blocks but the equivalent Bitmap index only required 2151 leaf blocks, way less than 1/2 the size, even though the clustering factor was terrible at nearly 2 million.
 
As a rough rule, any column that has an on average of just 1 duplicate per distinct column value (just 2 occurrences per distinct value) is a potential candidate for a bitmap index.
 
Bitmap indexes should only be considered in Data Warehouse, low concurrency DML type environments due to their locking implications and certainly pre 10g, Bitmap indexes had growth issues after significant DML changes. However it’s a complete nonsense to suggest that Bitmap indexes should only be considered with columns with “few” distinct values, else things will run 100s of times slower.
 
500,000 distinct values in a 1 million row table is not really that “few” at all is it …

About these ads

Comments»

1. Martin Preiss - April 13, 2010

Richard,
so I guess it’s a result of this smaller size that the creation of a bitmap index is in almost every case faster than the creation of a corresponding b*tree index?

Regards
Martin Preiss

Richard Foote - April 14, 2010

Hi Martin

In large part yes. If you only have to do (say) 1/10 of the block changes, if overall sorts can be reduced (as duplicates can be ignored), then obviously that’s less work for Oracle.

2. Andy - April 13, 2010

This is a real eye opener! Never thought of investigating that …

Hats off to you!

Richard Foote - April 14, 2010

Hi Andy

Ta muchly, glad it was of interest.

3. Henish - April 14, 2010

Dear Sir,

Once again nice presntation about bitmap indexe :)

Though One question.

You specify

For a Bitmap index, an index entry comprises:

The index column(s)
2 x 6 byte rowid
2 bytes for flag and lock bytes
*1 byte for each index column (a minimum of 2 bytes)
? bytes for the actual bitmap sequence

And than below

For a corresponding Bitmap index:

4 bytes index column
12 bytes for the two rowids
*4 bytes for each index column length byte (remembering the rowids and bitmap sequence are effectively additional indexed columns)
2 bytes for flag and lock bytes
2 bytes is all it takes for the bitmap sequence column (if there’s only 2 actual true bits per index entry)

= 24 bytes in total.

Please see * lines in the above quote

Don’t you think there is a typo in the first quote i.e. it should be “(a minimum of 4 bytes) ”
instead of “(a minimum of 2 bytes) ”

Total Length byte needed = 1 byte per column + 2 byte for 2 rowid + 1 byte for bitmap seq.

OR may i missing something

Thanks

4. Richard Foote - April 14, 2010

Hi Henish

Oh, the dangers of cut ‘n’ paste !!

Yes, it’s of course a minimum of 4 column lengths for a bitmap index (1 for the index column, 2 for each of the rowids and 1 for the bitmap column).

Thank you and well spotted :)

5. Gary - April 20, 2010

So in a ‘bulk-load’ only environment (not worried about deletes/update or concurrency issues) what sort of situations would point towards a btree index ? Is it just a unique (or very nearly unique) index ?

Richard Foote - April 21, 2010

Gary

Correct.

If the index is going to be both smaller and easily used in a bitmap combine type operation, then you get ll the benefits of bitmap indexes in such environment.

Only if the index is going to be somewhat larger and more costly to access would a btree be considered and that’s only the case with unique to very near to unique type columns.

Oracle Corp must go mental when they read articles like the Burleson one because it limits so very much the true potential and usage of bitmap indexes.

This low cardinality myth is far less common nowadays but people out there still believe it …

Martin Preiss - April 21, 2010

Richard,
what about multi-column indexes? I think, Jonathan Lewis wrote in his cbo-book that a bitmap index on two columns could be smaller than the sum of the sizes of two single column bitmap indexes, but i guess this depends on the number of dependent values as this index would include every given combination of the two values – is this correct?

So I would assume, that for multi-column indexes a b*tree could be smaller if the columns are not correlated, while the bitmap index would tend to be smaller for correlated values – or is this nonsense?

Of course a multi-column bitmap index is quite unusual, but i quess it could be beneficial as a fat index to avoid table access.

Regards
MP

6. Bitmap indexes and cardinality « Venzi's Weblog - April 23, 2010

[...] So what is a good cardinality estimate for a bitmap index column? Leave a Comment [...]

7. Richard Foote - April 26, 2010

Hi Martin

With multi-column indexes, it’s a balance act. Yes, they can be smaller in total and more efficient to access. But you also potentially lose some flexibility in their usefulness.

With regard to bitmap indexes, it’s probably worth a post on it’s own to discuss.

8. Martin Preiss - April 26, 2010

“it’s probably worth a post on it’s own to discuss.”

I have to confess: that was the main intention of my question…

9. Rudolf van der Heide - September 28, 2010

Very important to realize is that the Oracle is very powerful to combine bitmap indexes, so you need not worry about composite indexes. That makes bitmaps outperform even if the bitmaps one by one were bigger than the b-trees. In data warehouses in big tables we always use bitmap indexes for all foreign key columns, regardless the cardinality. This is because the power of combination.

Richard Foote - October 7, 2010

Hi Rudolf

Indeed, combining bitmap indexes is the big advantage with them. That said, if 2 or more columns are always used in combination, then a composite bitmap index has some additional advantages.

10. katkota - December 10, 2010

dear sir
as i understand thats the combination of indexes take more advantage than use single column index???
and what the different between the composite index and combination index?? Is there the same meaning?

Richard Foote - December 21, 2010

Hi Katkota

Composite indexes have the advantage of being potentially more efficient as they reduce row throw-a-ways. Multiple indexes have more flexibilty (as a composite index requires the leading column to be known in most cases) but are not going to be as efficient as a corresponding composite index, even if combined.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,715 other followers

%d bloggers like this: