Oracle 19c Automatic Indexing: Data Skew Part III (The Good Son) September 16, 2020
Posted by Richard Foote in 19c, 19c New Features, Autonomous Data Warehouse, Autonomous Database, Autonomous Transaction Processing, CBO, Data Skew, Index Access Path, Oracle, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Oracle Statistics, Oracle19c, Unusable Indexes.trackback
I’m going to expand just a tad on my previous posts on data skew and run a simple query that returns a few rows based on a column predicate AND another query on the same column that returns many rows.
The following table has a CODE column as with previous posts with the data heavily skewed:
SQL> create table bowie_skew (id number, code number, name varchar2(42)); Table created. SQL> insert into bowie_skew select rownum, 10, 'DAVID BOWIE' from dual connect by level <=1000000; 1000000 rows created. SQL> update bowie_skew set code = 9 where mod(id,3) = 0; 333333 rows updated. SQL> update bowie_skew set code = 1 where mod(id,2) = 0 and id between 1 and 20000; 10000 rows updated. SQL> update bowie_skew set code = 2 where mod(id,2) = 0 and id between 30001 and 40000; 5000 rows updated. SQL> update bowie_skew set code = 3 where mod(id,100) = 0 and id between 300001 and 400000; 1000 rows updated. SQL> update bowie_skew set code = 4 where mod(id,100) = 0 and id between 400001 and 500000; 1000 rows updated. SQL> update bowie_skew set code = 5 where mod(id,100) = 0 and id between 600001 and 700000; 1000 rows updated. SQL> update bowie_skew set code = 6 where mod(id,1000) = 0 and id between 700001 and 800000; 100 rows updated. SQL> update bowie_skew set code = 7 where mod(id,1000) = 0 and id between 800001 and 900000; 100 rows updated. SQL> update bowie_skew set code = 8 where mod(id,1000) = 0 and id between 900001 and 1000000; 100 rows updated. SQL> commit; Commit complete.
I’ll next collect statistics with NO histogram, as I don’t think they’re required at this point:
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'bowie_skew', estimate_percent=>100, method_opt=>'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed.
If we look at the table data:
SQL> select code, count(*) from bowie_skew group by code order by code; CODE COUNT(*) ---------- ---------- 1 10000 2 5000 3 1000 4 1000 5 1000 6 100 7 100 8 100 9 327235 10 654465
The value “7” only has 100 associated rows, while the value “10” is very common with 654,465 rows.
But I currently have no histograms:
SQL> select column_name, num_buckets, histogram from user_tab_cols where table_name='BOWIE_SKEW'; COLUMN_NAME NUM_BUCKETS HISTOGRAM --------------- ----------- --------------- ID 1 NONE CODE 1 NONE NAME 1 NONE
If I run the following query with a CODE=7 predicate just once:
SQL> select * from bowie_skew where code=7; 100 rows selected. Execution Plan -------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100K| 1953K| 570 (7)| 00:00:01 | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 100K| 1953K| 570 (7)| 00:00:01 | | 3 | PX BLOCK ITERATOR | | 100K| 1953K| 570 (7)| 00:00:01 | |* 4 | TABLE ACCESS STORAGE FULL| bowie_skew | 100K| 1953K| 570 (7)| 00:00:01 | --------------------------------------------------------------------------------------------
It uses a Full Table Scan (the CBO has no choice without an index) AND hopelessly gets the cardinality estimate wrong, thinking 100K are going to be returned (and not the 100 actual rows). So the CBO is unlikely to use an index anyways as it would be deemed too expensive to return so many rows.
I’ll now run the following query many times on the CODE=10 predicate that returns many rows:
SQL> select * from bowie_skew where code=10; 654465 rows selected. Execution Plan -------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | -------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100K| 1953K| 570 (7)| 00:00:01 | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 100K| 1953K| 570 (7)| 00:00:01 | | 3 | PX BLOCK ITERATOR | | 100K| 1953K| 570 (7)| 00:00:01 | |* 4 | TABLE ACCESS STORAGE FULL| bowie_skew | 100K| 1953K| 570 (7)| 00:00:01 | --------------------------------------------------------------------------------------------
So again, no choice here with a FTS and we likely wouldn’t want to use an index anyways as it would be just too expensive.
If we check out what the Automatic Indexing process has done with such a workload:
SQL> select dbms_auto_index.report_last_activity() report from dual; REPORT INDEX DETAILS ------------------------------------------------------------------------------- The following indexes were created: *: invisible ------------------------------------------------------------------------------- -------------------------------------------------------------------------- | Owner | Table | Index | Key | Type | Properties | -------------------------------------------------------------------------- | BOWIE | BOWIE_SKEW | SYS_AI_7psvzc164vbng | CODE | B-TREE | NONE | -------------------------------------------------------------------------- ------------------------------------------------------------------------------- VERIFICATION DETAILS ------------------------------------------------------------------------------- The performance of the following statements improved: ------------------------------------------------------------------------------- ------------------------------------------------------------------------------- Parsing Schema Name : BOWIE SQL ID : 6fm3m8cg2jnun SQL Text : select * from bowie_skew where code=7 Improvement Factor : 46.6x Execution Statistics: ----------------------------- Original Plan Auto Index Plan ---------------------------- ---------------------------- Elapsed Time (s): 36653 1992 CPU Time (s): 33899 967 Buffer Gets: 4291 103 Optimizer Cost: 52 4 Disk Reads: 0 2 Direct Writes: 0 0 Rows Processed: 100 100 Executions: 1 1
An Automatic Index on the CODE column is created (SYS_AI_7psvzc164vbng), with ONLY the SQL based on the CODE=7 predicate listed in the report. The other query is indeed too expensive for a new index to be viable and so isn’t listed.
If we look at the Plans Section of the Automatic Indexing report:
PLANS SECTION --------------------------------------------------------------------------------------------- - Original ----------------------------- Plan Hash Value : 410492785 -------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | -------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 52 | | | 1 | TABLE ACCESS STORAGE FULL | BOWIE_SKEW | 100000 | 2000000 | 52 | 00:00:01 | -------------------------------------------------------------------------------------- Notes ----- - dop_reason = no expensive parallel operation - dop = 1 - px_in_memory_imc = no - px_in_memory = no - With Auto Indexes ----------------------------- Plan Hash Value : 140816325 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 119 | 2380 | 4 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_SKEW | 119 | 2380 | 4 | 00:00:01 | | * 2 | INDEX RANGE SCAN | SYS_AI_7psvzc164vbng | 100 | | 3 | 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("CODE"=7) Notes ----- - Dynamic sampling used for this statement ( level = 11 )
The important point to note here is that the cardinality estimates are relatively accurate despite there being no histograms at this stage because the Automatic Indexing session uses Dynamic Sampling Level=11. Missing/inaccurate statistics are calculated on fly and this enables the session to accurately determine the size of the returned data set and that an index is indeed the more efficient access path.
So with mixed workloads, all it takes is one SQL executed once that demonstrably improves thanks to an index for the associated Automatic Index to be created as a VISIBLE/VALID index:
SQL> select index_name, auto, visibility, status, num_rows, leaf_blocks, clustering_factor from user_indexes where table_name='BOWIE_SKEW'; INDEX_NAME AUT VISIBILIT STATUS NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR ------------------------------ --- --------- -------- ---------- ----------- ----------------- SYS_AI_7psvzc164vbng YES VISIBLE VALID 1000000 1537 8534
If we now run the query AFTER the histograms are subsequently created thanks to the High-Frequency Automatic Statistics Collection (see previous post), the new Automatic Index is now used:
SQL> select * from bowie_skew where code=7; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 140816325 ------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 100 | 2000 | 4 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE_SKEW | 100 | 2000 | 4 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | SYS_AI_7psvzc164vbng | 100 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CODE"=7) Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of no expensive parallel operation Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 104 consistent gets 0 physical reads 0 redo size 2871 bytes sent via SQL*Net to client 359 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
Note if the histogram is NOT yet collected, the CBO will not determine the correct cardinality estimate and will ignore the new Automatic Index (as previously discussed).
If we run again the query that returns many rows:
SQL> select * from bowie_skew where code=10; 654465 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 410492785 ---------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 654K| 12M| 52 (16)| 00:00:01 | |* 1 | TABLE ACCESS STORAGE FULL| BOWIE_SKEW | 654K| 12M| 52 (16)| 00:00:01 | ---------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - storage("CODE"=10) filter("CODE"=10) Note ----- - automatic DOP: Computed Degree of Parallelism is 1 because of no expensive parallel operation Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 3725 consistent gets 0 physical reads 0 redo size 6549708 bytes sent via SQL*Net to client 1790 bytes received via SQL*Net from client 132 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 654465 rows processed
The new Automatic Index is correctly ignored by the CBO, as the query returns too many rows for the index to be viable.
So in this example, Automatic Indexing works exactly as it should. It creates a new Automatic Index for a query where it will indeed improve the performance, while other queries on the same column in which many more rows are returned are also run. For these other queries, the new Automatic Index is correctly not used as such an index would degrade the performance of the query.
In my next post, I’ll look at the first example with data skew where Automatic Indexing can be problematic…
Comments»
No comments yet — be the first.