Indexing The Oracle Autonomous Data Warehouse (Autobahn) April 2, 2019
Posted by Richard Foote in Autonomous Data Warehouse, Autonomous Database, Oracle Indexes.trackback
When the Autonomous Data Warehouse was initially released in March 2018, one of the notable restrictions was that it did not support typical Data Warehouse performance related structures such as Indexes, Partitioning or Materialized Views. The platform was deemed efficient enough to not require these features, which would unnecessarily complicate the autonomous environment.
In September 2018, these restrictions were finally lifted with the (for now) manual creation of Indexes, Partitioning and Materialized Views now all supported. Oracle however states that “Oracle recommends that you do not manually create these structures, and leave performance optimizations to the Autonomous Data Warehouse. If you’re a highly skilled Oracle Database tuning expert and decide to manually create these access structures, please be advised to test the impact of your manual tuning efforts on your full workload.”
If you’re not a “highly skilled Oracle Database tuning expert”, let me give you some advice on when to create indexes in the Autonomous Data Warehouse (on Oracle Database 18c where these features are NOT automatically created and maintained by autonomous processes).
I’ll initially focus on two of the key capabilities available in the Autonomous Data Warehouse environment that might indeed make some index structures unnecessary, although neither is unique to autonomous databases.
Let’s begin by creating a relatively large table based on the huge LINEORDER table found in the provided SSB schema. The following BIG_ZIGGY table is based on 1 billion rows from the LINEORDER table:
SQL> create table big_ziggy select * from ssb.lineorder where rownum<=1000000000;
Frustratingly, if you have a session that performs a database operation for an extended period, the session looses its connection and hangs indefinitely. You need to start a separate session to confirm when the task actually completes and track progress by say looking a v$session_longops or querying dba_segments.
So we have a decently sized BIG_ZIGGY table at around 27GB:
SQL> select table_name, num_rows, blocks, round(blocks/128) MB from user_tables where table_name='BIG_ZIGGY'; TABLE_NAME NUM_ROWS BLOCKS MB ------------ ---------- ---------- ---------- BIG_ZIGGY 1000000000 3534714 27615
Here are a couple of key points.
Firstly, if you run a query on a table using filtering columns for the first time, then none of the smarts in the autonomous database environments are going to help you here. Things will initially run slowly, because Oracle would not have detected yet there is anything that needs to be tuned. The tuning process is “reactive”, in that Oracle needs to find an issue first before it can potentially address it.
The second key point is that if only a small fraction of the overall data is required or need be accessed, if the data being accessed is very highly filtered, then a database index is likely going to be highly effective, even within the Autonomous Data Warehouse. This is precisely why indexes are typically required in databases.
Although very often, large volumes of data are indeed accessed in typical Data Warehouse workloads, this is not always the case. It’s certainly not uncommon for ad-hoc queries and the such to only access a small fraction of an available data set.
The following query on the large BIG_ZIGGY table only returns 9 rows:
SQL> select * from big_ziggy where lo_orderkey = 2294059393; 9 rows selected. Elapsed: 00:06:23.99 Execution Plan ---------------------------------------------------------- Plan hash value: 2097953955 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 505 | 67250 (12)| 00:00:03 | | 1 | RESULT CACHE | 2u5qwsq68ghc5bqn5spqhyqjyj | | | | | |* 2 | TABLE ACCESS STORAGE FULL | BIG_ZIGGY | 5 | 505 | 67250 (12)| 00:00:03 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage("LO_ORDERKEY"=2294059393) filter("LO_ORDERKEY"=2294059393) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=17; dependencies=(BOWIE.BIG_ZIGGY); name="select * from big_ ziggy where lo_orderkey = 2294059393" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 3530785 consistent gets 3530770 physical reads 0 redo size 2116 bytes sent via SQL*Net to client 552 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 9 rows processed
The query takes nearly 6.5 minutes to complete which is clearly a long time to return a few rows.
Note that the Autonomous Data Warehouse platform is Exadata and Exadata has a number of key capabilities, such a Hybrid Columnar Compression (HCC), Smarts Scans, Storage Indexes and the such that can be extremely useful in Data Warehouse environments.
For example, the table is smaller and can be accessed more efficiently thanks to HCC:
SQL> select table_name, compression, compress_for from user_tables where table_name='BIG_ZIGGY'; TABLE_NAME COMPRESS COMPRESS_FOR ------------ -------- ------------------------------ BIG_ZIGGY ENABLED QUERY HIGH ROW LEVEL LOCKING
Exadata Smart Scans and associated Storage Indexes can result in extremely efficient Full Table Scans by potentially skipping large sections of the physical data from having to be accessed and by returning just the required data set back to the database. I’ve blogged previously a number of times on the power of Storage Indexes and Exadata related smarts.
However, if a query is executed using specific filtering columns for the first time, then Storage Indexes will not have yet been created. If we look at the effectiveness of Storage Indexes when running the previous query:
SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index'); NAME MB ---------------------------------------------------------------- ---------- cell physical IO bytes saved by storage index 0 cell physical IO interconnect bytes returned by smart scan 7.79956818
Note that initially, no bytes are saved by storage indexes as they don’t initially exist. Although only a relatively small amount of data is actually returned to the database server, having to read the entire table is expensive and why the query is taking so long to complete.
If we run a number of different queries with a filter predicate on the same LO_CUSTKEY column and eventually run the following query:
SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491; 6 rows selected. Elapsed: 00:01:41.95 Execution Plan ---------------------------------------------------------- Plan hash value: 2097953955 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 125 | 67250 (12)| 00:00:03 | | 1 | RESULT CACHE | 9ycjfft9fx7gy506tr9d98djdj | | | | | |* 2 | TABLE ACCESS STORAGE FULL | BIG_ZIGGY | 5 | 125 | 67250 (12)| 00:00:03 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage("LO_ORDERKEY"=2294059491) filter("LO_ORDERKEY"=2294059491) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde rkey = 2294059491" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 3530785 consistent gets 3530770 physical reads 0 redo size 760 bytes sent via SQL*Net to client 552 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 6 rows processed
We note that the overall elapsed times have now decreased, taking only 1 minute, 45 seconds to complete.
If we now look at the effectiveness of storage indexes:
SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index'); NAME MB ---------------------------------------------------------------- ---------- cell physical IO bytes saved by storage index 731.78125 cell physical IO interconnect bytes returned by smart scan 7.59265137
The storage index is now providing some benefit but as I’ve discussed previously, there are limitations to their effectiveness depending on data distribution and the physical clustering of the data. So your mileage will vary on the effectiveness of smart scans and no, Oracle does not (currently) automatically re-cluster data to improve smart scan performance when most effective.
If you re-run the same query multiple times, another performance feature kicks in:
SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491; 6 rows selected. Elapsed: 00:00:01.74 Execution Plan ---------------------------------------------------------- Plan hash value: 2097953955 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 125 | 67250 (12)| 00:00:03 | | 1 | RESULT CACHE | 9ycjfft9fx7gy506tr9d98djdj | | | | | | * 2 | TABLE ACCESS STORAGE FULL | BIG_ZIGGY | 5 | 125 | 67250 (12)| 00:00:03 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage("LO_ORDERKEY"=2294059491) filter("LO_ORDERKEY"=2294059491) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 0 consistent gets 0 physical reads 0 redo size 760 bytes sent via SQL*Net to client 552 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 6 rows processed
Response times are now under 2 seconds (Australia is a long way from Ashburn, USA), with Oracle automatically using the Result Cache to directly access the necessary result and hence not requiring any table related consistent reads. This is as efficient as it gets, but it’s dependant on similar result sets being returned and no data changes that would invalidate the results cache such as a new data load into the data warehouse:
SQL> insert into big_ziggy select * from big_ziggy where rownum commit; Commit complete. SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059491; 6 rows selected. Elapsed: 00:01:44.79 Execution Plan ---------------------------------------------------------- Plan hash value: 2097953955 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 125 | 67250 (12)| 00:00:03 | | 1 | RESULT CACHE | 9ycjfft9fx7gy506tr9d98djdj | | | | | | * 2 | TABLE ACCESS STORAGE FULL | BIG_ZIGGY | 5 | 125 | 67250 (12)| 00:00:03 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage("LO_ORDERKEY"=2294059491) filter("LO_ORDERKEY"=2294059491) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde rkey = 2294059491" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 3530808 consistent gets 3530786 physical reads 0 redo size 760 bytes sent via SQL*Net to client 552 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 6 rows processed
After new data gets loaded into the table, we’re back to being reliant on the effectiveness of the Exadata smart scan.
So we have a scenario here where we either have excellent performance via the result cache or just average performance is we want data for specific data based on a moderately effective storage index and resultant smart scan (in a different session, the following query is run):
SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059492; Elapsed: 00:02:03.99 Execution Plan ---------------------------------------------------------- Plan hash value: 2097953955 --------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 5 | 125 | 67250 (12)| 00:00:03 | | 1 | RESULT CACHE | 4s0z6mkqpufc33mxb9ddvuc69r | | | | | | * 2 | TABLE ACCESS STORAGE FULL | BIG_ZIGGY | 5 | 125 | 67250 (12)| 00:00:03 | --------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - storage("LO_ORDERKEY"=2294059492) filter("LO_ORDERKEY"=2294059492) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); name="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orde rkey = 2294059492" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 3530785 consistent gets 3530770 physical reads 0 redo size 695 bytes sent via SQL*Net to client 552 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 2 rows processed SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index'); NAME MB ---------------------------------------------------------------- ---------- cell physical IO bytes saved by storage index 731.78125 cell physical IO interconnect bytes returned by smart scan 7.59265137
To get both consistent and excellent performance when running a query that performs a high level of filtering, we need to create an appropriate index as we would in any other regular data warehouse database environment:
SQL> create bitmap index big_ziggy_lo_orderkey_i on big_ziggy(lo_orderkey); SQL> select index_name, leaf_blocks, blevel, distinct_keys, status from user_indexes where table_name='BIG_ZIGGY'; INDEX_NAME LEAF_BLOCKS BLEVEL DISTINCT_KEYS STATUS ------------------------- ----------- ---------- ------------- -------- BIG_ZIGGY_LO_ORDERKEY_I 843144 3 212552698 VALID
If we now run random queries based on the highly selective LO_ORDERKEY column:
SQL> select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from big_ziggy where lo_orderkey = 2294059489; Elapsed: 00:00:01.67 Execution Plan ---------------------------------------------------------- Plan hash value: 1677294303 ------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 5 | 125 | 5 (0)| 00:00:01 | | 1 | RESULT CACHE | 7tn121mujfw3kfg39z5pg4duwy | | | | | | 2 | TABLE ACCESS BY INDEX ROWID | BIG_ZIGGY | 5 | 125 | 5 (0)| 00:00:01 | | 3 | BITMAP CONVERSION TO ROWIDS | | | | | | |* 4 | BITMAP INDEX SINGLE VALUE | BIG_ZIGGY_LO_ORDERKEY_I | | | | | ------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 4 - access("LO_ORDERKEY"=2294059489) Result Cache Information (identified by operation id): ------------------------------------------------------ 1 - column-count=5; dependencies=(BOWIE.BIG_ZIGGY); attributes=(ordered); nam e="select lo_orderkey, lo_linenumber, lo_custkey, lo_partkey, lo_quantity from b ig_ziggy where lo_orderkey = 2294059489" Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of no expensive parallel operation Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 8 consistent gets 0 physical reads 0 redo size 647 bytes sent via SQL*Net to client 553 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
We now get the required, consistent, high performance that we require, without having to resort to consuming additional CPU resources by increasing say the number of available CPUs and degree of parallelism to reduce response times.
In short, you index the Autonomous Data Warehouse as you would any database running on an Exadata platform…
Remember, this is all effectively version one of the Autonomous Data Warehouse. The capability of automatically creating necessary indexes is coming very soon…
This paragraph doesn’t seem to compute:
So we have a scenario here where we either have excellent performance via the result cache or just average performance is we want data for specific data based on a moderately effective storage index and resultant smart scan (in a different session, the following query is run):
LikeLike
[…] my previous post on Indexing The Autonomous Warehouse, I highlighted how it might be necessary to create indexes to improve the performance and […]
LikeLike