jump to navigation

Myth: Bitmap Indexes With High Distinct Columns (Supermassive Black Hole) March 3, 2010

Posted by Richard Foote in Bitmap Indexes, CBO, Clustering Factor, Oracle Indexes, Oracle Myths.
34 comments

As discussed in my previous post, it’s a silly myth to suggest a bitmap index should only be used for so-called “low cardinality” columns else the resultant index would be “huge”. I thought it might be worth a quick refresh to see a simple little demonstration why such claims are a nonsense.  There is in fact no limit to the number of distinct values by which a column could be considered a candidate for a bitmap index.  

I’ve already shown how a bitmap index on a column with 10,000 distinct values could potentially be smaller than an index with just 4 distinct values, even with the same data type and size. The number of distinct values in a column is but one small consideration, the number of rows in the table and the average ratio of repeated values are just as important. The other important consideration that can significant impact the size of a bitmap index is the distribution and clustering of the data within the table as we shall see.    

Using the same demo as the previous post, a reminder of the size of our bitmap index on a column with 10,000 distinct values in a 1 million row table.
 

SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks from dba_indexes where index_name = 'BIG_BOWIE_CODE_BITMAP_I';
 
INDEX_NAME                INDEX_TYPE DISTINCT_KEYS  BLEVEL LEAF_BLOCKS
------------------------- ---------- ------------- ------- -----------
BIG_BOWIE_CODE_BITMAP_I   BITMAP             10000       1          52
 

  

 
Let’s now see if the CBO will actually use this bitmap index on its own: 

  

SQL> select * from big_bowie where code = 42;

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4280385727

--------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                    | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                         |   100 |  7300 |    22   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID | BIG_BOWIE               |   100 |  7300 |    22   (0)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS|                         |       |       |            |          |
|*  3 |    BITMAP INDEX SINGLE VALUE | BIG_BOWIE_CODE_BITMAP_I |       |       |            |          |
--------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   3 - access("CODE"=42)

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          6  consistent gets
          0  physical reads
          0  redo size
       1854  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

 
 

Absolutely it does. As there are on average just 100 rows per distinct value, that's a small enough selectivity for the CBO to use the bitmap index on its own. Note the query has used just 6 consistent gets to return the 100 rows of data.
 
Let's now drop the bitmap index and see how a regular B-Tree index would compare and perform:
 

  
SQL> drop index big_bowie_code_bitmap_i;
 
Index dropped.
 
SQL> create index big_bowie_code_i on big_bowie(code);
 
Index created.
 
SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks, clustering_factor as cf from dba_indexes where index_name = 'BIG_BOWIE_CODE_I';
 
INDEX_NAME           INDEX_TYPE DISTINCT_KEYS  BLEVEL LEAF_BLOCKS         CF
-------------------- ---------- ------------- ------- ----------- ----------
BIG_BOWIE_CODE_I     NORMAL             10000       2        2090      10895

  

The first thing we notice is that the B-Tree index is significantly larger than the equivalent bitmap index. 1090 leafs blocks whereas the bitmap index was only 56 leaf blocks. So it's not the bitmap index that's so-called "huge" here on a column with 10,000 distinct values but the corresponding B-Tree index. Notice also that the Clustering Factor of the index is quite good at 10,895 in a 1 million row table.   

If we now run the same query as before: 

 
SQL> select * from big_bowie where code = 42;
 
100 rows selected.
 
Execution Plan
----------------------------------------------------------
Plan hash value: 1856845120
 
------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name             | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                  |   100 |  7300 |     5 (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| BIG_BOWIE        |   100 |  7300 |     5 (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | BIG_BOWIE_CODE_I |   100 |       |     3 (0)| 00:00:01 |
------------------------------------------------------------------------------------------------
 
Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("CODE"=42)
 
Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          8  consistent gets
          0  physical reads
          0  redo size
       1854  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed
 
 

 

The query also uses the index but as the B-Tree index is somewhat larger, the overall number of consistent reads has increased from 6 up to 8. So not only is the bitmap index substantially smaller despite having 10,000 distinct values, but it's also more efficient to use as a result.A key reason why the bitmap index is so small and compact is because the effective Clustering Factor of the indexed column is excellent. The data was inserted into the table in CODE column order and so all the values with the same CODE value are grouped, ordered and clustered together within the table. Within the bitmap index, this means there are large and few grouping of zeros (0) that can be compressed extremely efficiently. 

For example, for the first CODE value of 1, the bitmap sequence would look something like: 

111111 .... 1110000000000000000000000000000000........000000 

with the 100 values of 1 (true) grouped together followed only by a series of effectively 999,900 zeros (o representing false). The 999,900 zeros can be compressed back almost nothing at all. Note there could actually be somewhat more false (0) values than actual rows in the table but that's a discussion for another day. 

The next value of 2 would have a bitmap sequence something like: 

00000000....0001111111...11111100000000000000000000...0000 

with 100 zeros followed by 100 1s followed by effectively 999,800 zeros, with again the two grouping of zeros compressed down to almost nothing at all. 

And so on ... 
Let's now create a different table that contains the identical data but  this time with the CODE values effectively randomized throughout the table. The Clustering Factor of the CODE column in this case will be terrible: 

 
 
SQL> create table big_bowie_2 as select * from big_bowie order by mod(id,100);
 
Table created.
 
SQL> create bitmap index big_bowie_2_code_bm_i on big_bowie_2(code);
 
Index created.
 
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BIG_BOWIE_2', estimate_percent=>null, method_opt=> 'FOR ALL COLUMNS SIZE 1');
 
PL/SQL procedure successfully completed.
 
SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks from dba_indexes where index_name = 'BIG_BOWIE_2_CODE_BM_I';
 
INDEX_NAME             INDEX_TYPE DISTINCT_KEYS  BLEVEL LEAF_BLOCKS
---------------------- ---------- ------------- ------- -----------
BIG_BOWIE_2_CODE_BM_I  BITMAP             10000       1         520
 

 

The first thing we notice is that the bitmap index is now substantially larger than it was previously. It's gone from just 52 leaf blocks all the up to 520 blocks, a whole 10 times larger than before. The reason is all to do with the clustering of the data as now values for CODE are littered all over the table and are no longer grouped together. 

The bitmap sequence for the first CODE value of 1 would now look something like: 

00000000000100000000000000000..0000010000000000000000...0000010000000000001000000000...00000100000.... 

with the 1s now littered all over the place. This means it's far less efficient to compress the groups of zeros as there are now substantially more such groupings than before. The index is now 10 times larger as a result. 

If we now run the same query as before: 

  

SQL> select * from big_bowie_2 where code = 42;
 
100 rows selected.
 

Execution Plan
----------------------------------------------------------
Plan hash value: 1324437154
 
------------------------------------------------------------------------------------------------------
| Id  | Operation                    | Name                  | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |                       |   100 |  7300 |  22     (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID | BIG_BOWIE_2           |   100 |  7300 |  22     (0)| 00:00:01 |
|   2 |   BITMAP CONVERSION TO ROWIDS|                       |       |       |            |          |
|*  3 |    BITMAP INDEX SINGLE VALUE | BIG_BOWIE_2_CODE_BM_I |       |       |            |          |
------------------------------------------------------------------------------------------------------
 

Predicate Information (identified by operation id):
---------------------------------------------------
 
   3 - access("CODE"=42)
 

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        103  consistent gets
          0  physical reads
          0  redo size
       1854  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed
 

 

 Although the CBO is still using the bitmap index on its own, the performance of the query has deteriorated substantially with the number of consistent gets increasing from 6 all the way up to 103.  

So the bitmap index is now nowhere near as efficient. A bitmap index isn't that great with large numbers of distinct values if the associated clustering is poor and so the B-Tree index is the way to go after all, right ? 

Well just wait a minute. If the clustering is poor for a bitmap index, the clustering will likewise be poor for a corresponding B-Tree index as well.  Most of these consistent reads are due to reading the data out of the table, not from reading the index directly. So the performance of using an equivalent B-Tree index is also likely to be impacted as well. 

Let's compare the difference with a B-Tree index: 

   
SQL> drop index big_bowie_2_code_bm_i;
 
Index dropped.
 
SQL> create index big_bowie_2_code_i on big_bowie_2(code);
 
Index created.
 
SQL> select index_name, index_type, distinct_keys, blevel, leaf_blocks, clustering_factor as cf from dba_indexes where index_name = 'BIG_BOWIE_2_CODE_I';
 
INDEX_NAME             INDEX_TYPE DISTINCT_KEYS  BLEVEL LEAF_BLOCKS         CF
---------------------- ---------- ------------- ------- ----------- ----------
BIG_BOWIE_2_CODE_I     NORMAL             10000       2        2090     999922
 
 

 

The first thing to note here is that the B-Tree index is still 2090 leaf blocks in size. Even compared with the now far less efficient Bitmap index, at 520 leaf blocks it's still approximately just 25% the size of the B-Tree index. So the 10,000 distinct value bitmap index, even when it's as inefficient as possible, still can't be described as "huge"  here as it's still only a 1/4 the size of the corresponding B-Tree index. With a Clustering Factor of 999,992 on a 1 million rows table, it doesn't really get much worse than that and yet the Bitmap index on a 10,000 distinct column is still way ahead of the B-Tree index.  

Let's see how the query performs now: 

 
SQL> select * from big_bowie_2 where code = 42;
 
100 rows selected.
 

Execution Plan
----------------------------------------------------------
Plan hash value: 2550134594
 
--------------------------------------------------------------------------------------------------
| Id  | Operation                   | Name               | Rows  | Bytes |  Cost(%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                    |   100 |  7300 |   103   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| BIG_BOWIE_2        |   100 |  7300 |   103   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | BIG_BOWIE_2_CODE_I |   100 |       |     3   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
 

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("CODE"=42)
 

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
        105  consistent gets
          0  physical reads
          0  redo size
       1854  bytes sent via SQL*Net to client
        396  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

As we can see, the performance of the B-Tree index has likewise deteriorated with such a bad Clustering Factor. At 105 consistent gets, it's still worse than the corresponding Bitmap index which only needed 103 consistent gets. 

With a Bitmap index that is as inefficient as it can be, on a column that has 10,000 distinct values in a table of only 1 million rows, the Bitmap index still outperforms the corresponding B-Tree index. 

It's a complete myth and utter nonsense that a bitmap index is only suitable for low cardinality columns and would become "HUGE" and be 100s of times slower if created on so-called high cardinality column 

Let me repeat: There is no limit to the number of distinct values in a column for it to be considered for a Bitmap index.

Follow

Get every new post delivered to your Inbox.

Join 1,894 other followers