jump to navigation

Bitmap Index On Column With 212552698 Distinct Values, What Gives? (I’d Rather Be High) April 23, 2019

Posted by Richard Foote in Advanced Index Compression, Autonomous Data Warehouse, Autonomous Database, Index Compression, Oracle Indexes.
add a comment

The Next Day

In my previous post on Indexing The Autonomous Warehouse, I highlighted how it might be necessary to create indexes to improve the performance and scalability of highly selective queries, as it might on any Data Warehouse running on an Exadata platform. In the post, I created a Bitmap Index and showed how improve SQL performance insured.

I’ve had two subsequent correspondences asking why I decided to use a Bitmap Index in my example, when the column in question had a (relatively) massive number of distinct values, some 212552698. What gives, wouldn’t it have been far more efficient to have used a standard B-Tree index instead, as Bitmap Indexes are only suitable for columns that have low-medium numbers of distinct values? There’s nothing particularly low-medium cardinality about 212552698 distinct values.

It’s a good question and as 2 people asked basically the same question, thought it worthy of a separate post to try to address it.

Firstly, I must admit the only calculation I performed mentally when deciding which type of index to use was to look at the number of distinct values, 212552698 and the number of rows in the table, 1 billion and determine that there’s approximately 5 rows per distinct value. With 5 repeated values on average, this was sufficient for the Bitmap Index to likely be the more efficient type of index and so I created a Bitmap Index in my demo, considering this was a Data Warehouse environment where potential locking issues relating to concurrent DMLs was not an issue.

As I’ve discussed previously, as with this classic post from way back in 2010, “So What Is A Good Cardinality Estimate For A Bitmap Index Column“, it’s not the number of distinct values that’s important, it’s the column cardinality and the average number of repeated column values that’s important when deciding if a Bitmap Index might be appropriate in a Data Warehouse environment.

As above post discusses, a Bitmap Index might only need the one index entry per column value and as the resultant index bitmap string can be very very highly compressed if there are few actual occurrences per value, a column with as many as 5 repeated values of average would likely be more efficient than a corresponding B-Tree Index. A B-Tree Index with 5 repeated values would require the value to be stored 5 times, with their 5 corresponding rowids, whereas the Bitmap Index might only need to store the indexed value once with its 2 corresponding rowids and a trivial amount for the highly compressed bitmap string.

If we create an equivalent B-Tree Index to the previously created Bitmap Index:

SQL> create index big_ziggy_lo_orderkey_i2 on big_ziggy(lo_orderkey) invisible;

SQL> select index_name, index_type, leaf_blocks, compression from user_indexes
     where table_name = 'BIG_ZIGGY';

INDEX_NAME                     INDEX_TYPE LEAF_BLOCKS COMPRESSION
------------------------------ ---------- ----------- -------------
BIG_ZIGGY_LO_ORDERKEY_I        BITMAP          843144 DISABLED
BIG_ZIGGY_LO_ORDERKEY_I2       NORMAL         2506552 DISABLED

We notice the Bitmap Index at 843144 leaf blocks is indeed substantially smaller than the equivalent B-Tree Index at 2506552 leaf blocks.

However, this is the Oracle Autonomous Data Warehouse environment where I have accessed to the Advanced Compression option and the capability to potentially create highly compressed B-Tree index structures. See various previous posts on Index Advanced Compression.

Perhaps, I can create a smaller structure to the Bitmap Index by using Index Advanced Compress. Let’s try initially with Advanced Compression Low:

SQL> drop index big_ziggy_lo_orderkey_i2;

Index dropped.

SQL> create index big_ziggy_lo_orderkey_i3 on big_ziggy(lo_orderkey) compress advanced low invisible;
...

SQL> select index_name, index_type, leaf_blocks, compression from user_indexes
     where table_name = 'BIG_ZIGGY';

INDEX_NAME                     INDEX_TYPE LEAF_BLOCKS COMPRESSION
------------------------------ ---------- ----------- -------------
BIG_ZIGGY_LO_ORDERKEY_I        BITMAP          843144 DISABLED
BIG_ZIGGY_LO_ORDERKEY_I3       NORMAL         1921296 ADVANCED LOW

We notice the B-Tree Index is indeed now smaller at just 1921296 leaf blocks, down from 2506552 leaf blocks, but still not as small as the 843144 leaf blocks of the Bitmap Index.

However, this is a Data Warehouse environment where the DML overheads associated with the maintenance of Advanced Compression High indexes may not be of such concern. Additionally, I might have ample CPU resources to cater for any additional CPU requirements of accessing such Advanced Compression High indexes. Therefore, let’s create a B-Tree Index with Advanced Compression High:

SQL> drop index big_ziggy_lo_orderkey_i3;

Index dropped.

SQL> create index big_ziggy_lo_orderkey_i4 on big_ziggy(lo_orderkey) compress advanced high invisible;
...

SQL> select index_name, index_type, leaf_blocks, compression from user_indexes
     where table_name = 'BIG_ZIGGY';

INDEX_NAME                     INDEX_TYPE LEAF_BLOCKS COMPRESSION
------------------------------ ---------- ----------- -------------
BIG_ZIGGY_LO_ORDERKEY_I        BITMAP          843144 DISABLED
BIG_ZIGGY_LO_ORDERKEY_I4       NORMAL          786537 ADVANCED HIGH

We notice the index has significantly reduced further in size and at just 786537 leaf blocks in now indeed smaller than the corresponding Bitmap Index.

Note that although it might be beneficial to use Advanced Compressed High on Bitmap Indexes, such an option is not currently supported.

Of course your mileage will vary on which Index Type and Compression Option will be most appropriate depending on the characteristics of your data, however it might be worth considering these options in environments where you have access to these indexing options.

But yes, a Bitmap Index might indeed be the better option, even if there are billions of distinct values (if there are say 10s of billions of rows) for the columns to be indexed…

Indexing The Oracle Autonomous Data Warehouse (Autobahn) April 2, 2019

Posted by Richard Foote in Autonomous Data Warehouse, Autonomous Database, Oracle Indexes.
2 comments

autobahn

When the Autonomous Data Warehouse was initially released in March 2018, one of the notable restrictions was that it did not support typical Data Warehouse performance related structures such as Indexes, Partitioning or Materialized Views. The platform was deemed efficient enough to not require these features, which would unnecessarily complicate the autonomous environment.

In September 2018, these restrictions were finally lifted with the (for now) manual creation of Indexes, Partitioning and Materialized Views now all supported. Oracle however states that “Oracle recommends that you do not manually create these structures, and leave performance optimizations to the Autonomous Data Warehouse. If you’re a highly skilled Oracle Database tuning expert and decide to manually create these access structures, please be advised to test the impact of your manual tuning efforts on your full workload.

If you’re not a “highly skilled Oracle Database tuning expert”, let me give you some advice on when to create indexes in the Autonomous Data Warehouse (on Oracle Database 18c where these features are NOT automatically created and maintained by autonomous processes).

I’ll initially focus on two of the key capabilities available in the Autonomous Data Warehouse environment that might indeed make some index structures unnecessary, although neither is unique to autonomous databases.

Let’s begin by creating a relatively large table based on the huge LINEORDER table found in the provided SSB schema. The following BIG_ZIGGY table is based on 1 billion rows from the LINEORDER table:

SQL> create table big_ziggy select * from ssb.lineorder where rownum<=1000000000;

Frustratingly, if you have a session that performs a database operation for an extended period, the session looses its connection and hangs indefinitely. You need to start a separate session to confirm when the task actually completes and track progress by say looking a v$session_longops or querying dba_segments.

So we have a decently sized BIG_ZIGGY table at around 27GB:

SQL> select table_name, num_rows, blocks, round(blocks/128) MB
from user_tables where table_name='BIG_ZIGGY';

TABLE_NAME     NUM_ROWS     BLOCKS         MB
------------ ---------- ---------- ----------
BIG_ZIGGY    1000000000    3534714      27615

Here are a couple of key points.

Firstly, if you run a query on a table using filtering columns for the first time, then none of the smarts in the autonomous database environments are going to help you here. Things will initially run slowly, because Oracle would not have detected yet there is anything that needs to be tuned. The tuning process is “reactive”, in that Oracle needs to find an issue first before it can potentially address it.

The second key point is that if only a small fraction of the overall data is required or need be accessed, if the data being accessed is very highly filtered, then a database index is likely going to be highly effective, even within the Autonomous Data Warehouse. This is precisely why indexes are typically required in databases.

Although very often, large volumes of data are indeed accessed in typical Data Warehouse workloads, this is not always the case. It’s certainly not uncommon for ad-hoc queries and the such to only access a small fraction of an available data set.

The following query on the large BIG_ZIGGY table only returns 9 rows:

SQL> select * from big_ziggy
where lo_orderkey = 2294059393;

9 rows selected.

Elapsed: 00:06:23.99

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
| Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
---------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                            |    5 |   505 |  67250 (12)|  00:00:03 |
|  1 |  RESULT CACHE               | 2u5qwsq68ghc5bqn5spqhyqjyj |      |       |            |           |
|* 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   505 |  67250 (12)|  00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059393)
     filter("LO_ORDERKEY"=2294059393)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=17; dependencies=(BOWIE.BIG_ZIGGY); name="select * from big_
ziggy where lo_orderkey = 2294059393"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
   2116 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      9 rows processed

The query takes nearly 6.5 minutes to complete which is clearly a long time to return a few rows.

Note that the Autonomous Data Warehouse platform is Exadata and Exadata has a number of key capabilities, such a Hybrid Columnar Compression (HCC), Smarts Scans, Storage Indexes and the such that can be extremely useful in Data Warehouse environments.

For example, the table is smaller and can be accessed more efficiently thanks to HCC:

SQL> select table_name, compression, compress_for
from user_tables where table_name='BIG_ZIGGY';

TABLE_NAME   COMPRESS COMPRESS_FOR
------------ -------- ------------------------------
BIG_ZIGGY     ENABLED QUERY HIGH ROW LEVEL LOCKING

Exadata Smart Scans and associated Storage Indexes can result in extremely efficient Full Table Scans by potentially skipping large sections of the physical data from having to be accessed and by returning just the required data set back to the database. I’ve blogged previously a number of times on the power of Storage Indexes and Exadata related smarts.

However, if a query is executed using specific filtering columns for the first time, then Storage Indexes will not have yet been created. If we look at the effectiveness of Storage Indexes when running the previous query:

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s
where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan       7.79956818

Note that initially, no bytes are saved by storage indexes as they don’t initially exist. Although only a relatively small amount of data is actually returned to the database server, having to read the entire table is expensive and why the query is taking so long to complete.

If we run a number of different queries with a filter predicate on the same LO_CUSTKEY column and eventually run the following query:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity
from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:01:41.95

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
| Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
---------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)|  00:00:03 |
|  1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |           |
|* 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)|  00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
    760 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      6 rows processed

We note that the overall elapsed times have now decreased, taking only 1 minute, 45 seconds to complete.

If we now look at the effectiveness of storage indexes:

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s
where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                     731.78125
cell physical IO interconnect bytes returned by smart scan       7.59265137

The storage index is now providing some benefit but as I’ve discussed previously, there are limitations to their effectiveness depending on data distribution and the physical clustering of the data.  So your mileage will vary on the effectiveness of smart scans and no, Oracle does not (currently) automatically re-cluster data to improve smart scan performance when most effective.

If you re-run the same query multiple times, another performance feature kicks in:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:00:01.74

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
  0 recursive calls
  0 db block gets
  0 consistent gets
  0 physical reads
  0 redo size
760 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  6 rows processed

Response times are now under 2 seconds (Australia is a long way from Ashburn, USA), with Oracle automatically using the Result Cache to directly access the necessary result and hence not requiring any table related consistent reads. This is as efficient as it gets, but it’s dependant on similar result sets being returned and no data changes that would invalidate the results cache such as a new data load into the data warehouse:

SQL> insert into big_ziggy select * from big_ziggy where rownum  commit;

Commit complete.

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:01:44.79

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530808 consistent gets
3530786 physical reads
      0 redo size
    760 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      6 rows processed

After new data gets loaded into the table, we’re back to being reliant on the effectiveness of the Exadata smart scan.

So we have a scenario here where we either have excellent performance via the result cache or just average performance is we want data for specific data based on a moderately effective storage index and resultant smart scan (in a different session, the following query is run):

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059492;

Elapsed: 00:02:03.99

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 4s0z6mkqpufc33mxb9ddvuc69r |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059492)
     filter("LO_ORDERKEY"=2294059492)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059492"
Note
-----
- automatic DOP: Computed Degree of Parallelism is 1
Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
    695 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      2 rows processed

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                     731.78125
cell physical IO interconnect bytes returned by smart scan       7.59265137

To get both consistent and excellent performance when running a query that performs a high level of filtering, we need to create an appropriate index as we would in any other regular data warehouse database environment:

SQL> create bitmap index big_ziggy_lo_orderkey_i on big_ziggy(lo_orderkey);
SQL> select index_name, leaf_blocks, blevel, distinct_keys, status from user_indexes where table_name='BIG_ZIGGY';

INDEX_NAME                LEAF_BLOCKS     BLEVEL DISTINCT_KEYS STATUS
------------------------- ----------- ---------- ------------- --------
BIG_ZIGGY_LO_ORDERKEY_I        843144          3     212552698 VALID

If we now run random queries based on the highly selective LO_ORDERKEY column:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059489;

Elapsed: 00:00:01.67

Execution Plan
----------------------------------------------------------
Plan hash value: 1677294303

------------------------------------------------------------------------------------------------------------
| Id | Operation                      | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                            |    5 |   125 |       5 (0)|  00:00:01 |
|  1 |  RESULT CACHE                  | 7tn121mujfw3kfg39z5pg4duwy |      |       |            |           |
|  2 |   TABLE ACCESS BY INDEX ROWID  | BIG_ZIGGY                  |    5 |   125 |       5 (0)|  00:00:01 |
|  3 |    BITMAP CONVERSION TO ROWIDS |                            |      |       |            |           |
|* 4 |     BITMAP INDEX SINGLE VALUE  | BIG_ZIGGY_LO_ORDERKEY_I    |      |       |            |           |
------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("LO_ORDERKEY"=2294059489)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); attributes=(ordered); nam
e="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from b
ig_ziggy where lo_orderkey = 2294059489"
Note
-----
- automatic DOP: Computed Degree of Parallelism is 1 because of no expensive
parallel operation

Statistics
----------------------------------------------------------
  1 recursive calls
  0 db block gets
  8 consistent gets
  0 physical reads
  0 redo size
647 bytes sent via SQL*Net to client
553 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  1 rows processed

We now get the required, consistent, high performance that we require, without having to resort to consuming additional CPU resources by increasing say the number of available CPUs and degree of parallelism to reduce response times.

In short, you index the Autonomous Data Warehouse as you would any database running on an Exadata platform…

Remember, this is all effectively version one of the Autonomous Data Warehouse. The capability of automatically creating necessary indexes is coming very soon…