jump to navigation

Indexing The Oracle Autonomous Data Warehouse (Autobahn) April 2, 2019

Posted by Richard Foote in Autonomous Data Warehouse, Autonomous Database, Oracle Indexes.
2 comments

autobahn

When the Autonomous Data Warehouse was initially released in March 2018, one of the notable restrictions was that it did not support typical Data Warehouse performance related structures such as Indexes, Partitioning or Materialized Views. The platform was deemed efficient enough to not require these features, which would unnecessarily complicate the autonomous environment.

In September 2018, these restrictions were finally lifted with the (for now) manual creation of Indexes, Partitioning and Materialized Views now all supported. Oracle however states that “Oracle recommends that you do not manually create these structures, and leave performance optimizations to the Autonomous Data Warehouse. If you’re a highly skilled Oracle Database tuning expert and decide to manually create these access structures, please be advised to test the impact of your manual tuning efforts on your full workload.

If you’re not a “highly skilled Oracle Database tuning expert”, let me give you some advice on when to create indexes in the Autonomous Data Warehouse (on Oracle Database 18c where these features are NOT automatically created and maintained by autonomous processes).

I’ll initially focus on two of the key capabilities available in the Autonomous Data Warehouse environment that might indeed make some index structures unnecessary, although neither is unique to autonomous databases.

Let’s begin by creating a relatively large table based on the huge LINEORDER table found in the provided SSB schema. The following BIG_ZIGGY table is based on 1 billion rows from the LINEORDER table:

SQL> create table big_ziggy select * from ssb.lineorder where rownum<=1000000000;

Frustratingly, if you have a session that performs a database operation for an extended period, the session looses its connection and hangs indefinitely. You need to start a separate session to confirm when the task actually completes and track progress by say looking a v$session_longops or querying dba_segments.

So we have a decently sized BIG_ZIGGY table at around 27GB:

SQL> select table_name, num_rows, blocks, round(blocks/128) MB
from user_tables where table_name='BIG_ZIGGY';

TABLE_NAME     NUM_ROWS     BLOCKS         MB
------------ ---------- ---------- ----------
BIG_ZIGGY    1000000000    3534714      27615

Here are a couple of key points.

Firstly, if you run a query on a table using filtering columns for the first time, then none of the smarts in the autonomous database environments are going to help you here. Things will initially run slowly, because Oracle would not have detected yet there is anything that needs to be tuned. The tuning process is “reactive”, in that Oracle needs to find an issue first before it can potentially address it.

The second key point is that if only a small fraction of the overall data is required or need be accessed, if the data being accessed is very highly filtered, then a database index is likely going to be highly effective, even within the Autonomous Data Warehouse. This is precisely why indexes are typically required in databases.

Although very often, large volumes of data are indeed accessed in typical Data Warehouse workloads, this is not always the case. It’s certainly not uncommon for ad-hoc queries and the such to only access a small fraction of an available data set.

The following query on the large BIG_ZIGGY table only returns 9 rows:

SQL> select * from big_ziggy
where lo_orderkey = 2294059393;

9 rows selected.

Elapsed: 00:06:23.99

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
| Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
---------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                            |    5 |   505 |  67250 (12)|  00:00:03 |
|  1 |  RESULT CACHE               | 2u5qwsq68ghc5bqn5spqhyqjyj |      |       |            |           |
|* 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   505 |  67250 (12)|  00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059393)
     filter("LO_ORDERKEY"=2294059393)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=17; dependencies=(BOWIE.BIG_ZIGGY); name="select * from big_
ziggy where lo_orderkey = 2294059393"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
   2116 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      9 rows processed

The query takes nearly 6.5 minutes to complete which is clearly a long time to return a few rows.

Note that the Autonomous Data Warehouse platform is Exadata and Exadata has a number of key capabilities, such a Hybrid Columnar Compression (HCC), Smarts Scans, Storage Indexes and the such that can be extremely useful in Data Warehouse environments.

For example, the table is smaller and can be accessed more efficiently thanks to HCC:

SQL> select table_name, compression, compress_for
from user_tables where table_name='BIG_ZIGGY';

TABLE_NAME   COMPRESS COMPRESS_FOR
------------ -------- ------------------------------
BIG_ZIGGY     ENABLED QUERY HIGH ROW LEVEL LOCKING

Exadata Smart Scans and associated Storage Indexes can result in extremely efficient Full Table Scans by potentially skipping large sections of the physical data from having to be accessed and by returning just the required data set back to the database. I’ve blogged previously a number of times on the power of Storage Indexes and Exadata related smarts.

However, if a query is executed using specific filtering columns for the first time, then Storage Indexes will not have yet been created. If we look at the effectiveness of Storage Indexes when running the previous query:

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s
where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                             0
cell physical IO interconnect bytes returned by smart scan       7.79956818

Note that initially, no bytes are saved by storage indexes as they don’t initially exist. Although only a relatively small amount of data is actually returned to the database server, having to read the entire table is expensive and why the query is taking so long to complete.

If we run a number of different queries with a filter predicate on the same LO_CUSTKEY column and eventually run the following query:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity
from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:01:41.95

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
| Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
---------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)|  00:00:03 |
|  1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |           |
|* 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)|  00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
    760 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      6 rows processed

We note that the overall elapsed times have now decreased, taking only 1 minute, 45 seconds to complete.

If we now look at the effectiveness of storage indexes:

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s
where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                     731.78125
cell physical IO interconnect bytes returned by smart scan       7.59265137

The storage index is now providing some benefit but as I’ve discussed previously, there are limitations to their effectiveness depending on data distribution and the physical clustering of the data.  So your mileage will vary on the effectiveness of smart scans and no, Oracle does not (currently) automatically re-cluster data to improve smart scan performance when most effective.

If you re-run the same query multiple times, another performance feature kicks in:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:00:01.74

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
  0 recursive calls
  0 db block gets
  0 consistent gets
  0 physical reads
  0 redo size
760 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  6 rows processed

Response times are now under 2 seconds (Australia is a long way from Ashburn, USA), with Oracle automatically using the Result Cache to directly access the necessary result and hence not requiring any table related consistent reads. This is as efficient as it gets, but it’s dependant on similar result sets being returned and no data changes that would invalidate the results cache such as a new data load into the data warehouse:

SQL> insert into big_ziggy select * from big_ziggy where rownum  commit;

Commit complete.

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491;

6 rows selected.

Elapsed: 00:01:44.79

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 9ycjfft9fx7gy506tr9d98djdj |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059491)
     filter("LO_ORDERKEY"=2294059491)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059491"

Note
-----

- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530808 consistent gets
3530786 physical reads
      0 redo size
    760 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      6 rows processed

After new data gets loaded into the table, we’re back to being reliant on the effectiveness of the Exadata smart scan.

So we have a scenario here where we either have excellent performance via the result cache or just average performance is we want data for specific data based on a moderately effective storage index and resultant smart scan (in a different session, the following query is run):

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059492;

Elapsed: 00:02:03.99

Execution Plan
----------------------------------------------------------
Plan hash value: 2097953955

---------------------------------------------------------------------------------------------------------
|  Id | Operation                   | Name                       | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                            |    5 |   125 |  67250 (12)| 00:00:03 |
|   1 |  RESULT CACHE               | 4s0z6mkqpufc33mxb9ddvuc69r |      |       |            |          |
| * 2 |   TABLE ACCESS STORAGE FULL | BIG_ZIGGY                  |    5 |   125 |  67250 (12)| 00:00:03 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - storage("LO_ORDERKEY"=2294059492)
     filter("LO_ORDERKEY"=2294059492)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey,
lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde
rkey = 2294059492"
Note
-----
- automatic DOP: Computed Degree of Parallelism is 1
Statistics
----------------------------------------------------------
      1 recursive calls
      0 db block gets
3530785 consistent gets
3530770 physical reads
      0 redo size
    695 bytes sent via SQL*Net to client
    552 bytes received via SQL*Net from client
      2 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
      2 rows processed

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME                                                                     MB
---------------------------------------------------------------- ----------
cell physical IO bytes saved by storage index                     731.78125
cell physical IO interconnect bytes returned by smart scan       7.59265137

To get both consistent and excellent performance when running a query that performs a high level of filtering, we need to create an appropriate index as we would in any other regular data warehouse database environment:

SQL> create bitmap index big_ziggy_lo_orderkey_i on big_ziggy(lo_orderkey);
SQL> select index_name, leaf_blocks, blevel, distinct_keys, status from user_indexes where table_name='BIG_ZIGGY';

INDEX_NAME                LEAF_BLOCKS     BLEVEL DISTINCT_KEYS STATUS
------------------------- ----------- ---------- ------------- --------
BIG_ZIGGY_LO_ORDERKEY_I        843144          3     212552698 VALID

If we now run random queries based on the highly selective LO_ORDERKEY column:

SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059489;

Elapsed: 00:00:01.67

Execution Plan
----------------------------------------------------------
Plan hash value: 1677294303

------------------------------------------------------------------------------------------------------------
| Id | Operation                      | Name                       | Rows | Bytes | Cost (%CPU)| Time      |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT               |                            |    5 |   125 |       5 (0)|  00:00:01 |
|  1 |  RESULT CACHE                  | 7tn121mujfw3kfg39z5pg4duwy |      |       |            |           |
|  2 |   TABLE ACCESS BY INDEX ROWID  | BIG_ZIGGY                  |    5 |   125 |       5 (0)|  00:00:01 |
|  3 |    BITMAP CONVERSION TO ROWIDS |                            |      |       |            |           |
|* 4 |     BITMAP INDEX SINGLE VALUE  | BIG_ZIGGY_LO_ORDERKEY_I    |      |       |            |           |
------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

4 - access("LO_ORDERKEY"=2294059489)

Result Cache Information (identified by operation id):
------------------------------------------------------

1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); attributes=(ordered); nam
e="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from b
ig_ziggy where lo_orderkey = 2294059489"
Note
-----
- automatic DOP: Computed Degree of Parallelism is 1 because of no expensive
parallel operation

Statistics
----------------------------------------------------------
  1 recursive calls
  0 db block gets
  8 consistent gets
  0 physical reads
  0 redo size
647 bytes sent via SQL*Net to client
553 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  1 rows processed

We now get the required, consistent, high performance that we require, without having to resort to consuming additional CPU resources by increasing say the number of available CPUs and degree of parallelism to reduce response times.

In short, you index the Autonomous Data Warehouse as you would any database running on an Exadata platform…

Remember, this is all effectively version one of the Autonomous Data Warehouse. The capability of automatically creating necessary indexes is coming very soon…