Oracle 19c Automatic Indexing: Data Skew Part I (A Saucerful of Secrets) September 10, 2020
Posted by Richard Foote in 19c, 19c New Features, Automatic Indexing, Autonomous Data Warehouse, Autonomous Database, Autonomous Transaction Processing, Data Skew, Full Table Scans, Histograms, Index Access Path, Index statistics, Low Cardinality, Oracle Blog, Oracle Indexes, Oracle19c, Performance Tuning.1 comment so far
When it comes to Automatic Indexes, things can become particularly interesting when dealing with data skew (meaning that some columns values are much less common than other column values). The next series of blog posts will look at a number of different scenarios in relation to how Automatic Indexing works with data that is skewed and not uniformly distributed.
I’ll start with a simple little example, that has an interesting little twist at the end.
The following table has a CODE column, which has 10 distinct values that a widely skewed, with some values much less common than others:
SQL> create table bowie_skew (id number, code number, name varchar2(42)); Table created. SQL> insert into bowie_skew select rownum, 10, 'DAVID BOWIE' from dual connect by level <=1000000; 1000000 rows created. SQL> update bowie_skew set code = 9 where mod(id,3) = 0; 333333 rows updated. SQL> update bowie_skew set code = 1 where mod(id,2) = 0 and id between 1 and 20000; 10000 rows updated. SQL> update bowie_skew set code = 2 where mod(id,2) = 0 and id between 30001 and 40000; 5000 rows updated. SQL> update bowie_skew set code = 3 where mod(id,100) = 0 and id between 300001 and 400000; 1000 rows updated. SQL> update bowie_skew set code = 4 where mod(id,100) = 0 and id between 400001 and 500000; 1000 rows updated. SQL> update bowie_skew set code = 5 where mod(id,100) = 0 and id between 600001 and 700000; 1000 rows updated. SQL> update bowie_skew set code = 6 where mod(id,1000) = 0 and id between 700001 and 800000; 100 rows updated. SQL> update bowie_skew set code = 7 where mod(id,1000) = 0 and id between 800001 and 900000; 100 rows updated. SQL> update bowie_skew set code = 8 where mod(id,1000) = 0 and id between 900001 and 1000000; 100 rows updated. SQL> commit; Commit complete.
I’ll collect statistics on this table, but explicitly NOT collect histograms, so that the CBO will have no idea that the data is actually skewed. Note if I collected data with the default size, there would still be no histograms, as the column has yet to be used within an SQL predicate and so has no column usage recorded.
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE_SKEW', estimate_percent=>100, method_opt=>'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed.
We can clearly see that some CODE values (such as “6”) have relatively few values, with only 100 occurrences:
SQL> select code, count(*) from bowie_skew group by code order by code; CODE COUNT(*) ---------- ---------- 1 10000 2 5000 3 1000 4 1000 5 1000 6 100 7 100 8 100 9 327235 10 654465
As I explicitly collected statistics with SIZE 1, we currently have NO histograms in the table:
SQL> select column_name, num_buckets, histogram from user_tab_cols where table_name='BOWIE_SKEW'; COLUMN_NAME NUM_BUCKETS HISTOGRAM --------------- ----------- --------------- ID 1 NONE CODE 1 NONE NAME 1 NONE
Let’s now run the following query with a predicate on CODE=6, returning just 100 rows:
SQL> select * from bowie_skew where code=6; 100 rows selected. Execution Plan ------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100K| 1953K| 570 (7)| 00:00:01 | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 100K| 1953K| 570 (7)| 00:00:01 | | 3 | PX BLOCK ITERATOR | | 100K| 1953K| 570 (7)| 00:00:01 | |* 4 | TABLE ACCESS STORAGE FULL| BOWIE_SKEW | 100K| 1953K| 570 (7)| 00:00:01 | ------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 4 - storage("CODE"=6) filter("CODE"=6) Statistics ---------------------------------------------------------- 6 recursive calls 0 db block gets 3781 consistent gets 0 physical reads 0 redo size 2796 bytes sent via SQL*Net to client 654 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
The CBO has no choice but to use a FTS as I currently have no indexes on the CODE column. Note also that the CBO has got its cardinality estimates way wrong, expecting 100,000 rows and not the actual 100 rows, as I have no histograms on the CODE column.
So let’s now wait 15 minutes or so and see what the Automatic Indexing process decides to do. Following are portions of the next Auto Indexing report:
INDEX DETAILS ------------------------------------------------------------------------------- The following indexes were created: -------------------------------------------------------------------------- | Owner | Table | Index | Key | Type | Properties | -------------------------------------------------------------------------- | BOWIE | BOWIE_SKEW | SYS_AI_7psvzc164vbng | CODE | B-TREE | NONE | -------------------------------------------------------------------------- VERIFICATION DETAILS ------------------------------------------------------------------------------- The performance of the following statements improved: ------------------------------------------------------------------------------- Parsing Schema Name : BOWIE SQL ID : fn4shnphu4bvj SQL Text : select * from bowie_skew where code=6 Improvement Factor : 41.1x Execution Statistics: ----------------------------- Original Plan Auto Index Plan ---------------------------- ---------------------------- Elapsed Time (s): 119596 322 CPU Time (s): 100781 322 Buffer Gets: 11347 103 Optimizer Cost: 570 4 Disk Reads: 0 0 Direct Writes: 0 0 Rows Processed: 100 100 Executions: 1 1
So we can see that yes, Auto Indexing has decided to create a new index here on the CODE column (“SYS_AI_7psvzc164vbng“) as it improves the performance of the query by a factor of 41.1x.
If we look further down the Auto Indexing report and compare the execution plans:
PLANS SECTION --------------------------------------------------------------------------------------------- - Original ----------------------------- Plan Hash Value : 3374004665 ----------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ----------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | 570 | | | 1 | PX COORDINATOR | | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10000 | 100000 | 2000000 | 570 | 00:00:01 | | 3 | PX BLOCK ITERATOR | | 100000 | 2000000 | 570 | 00:00:01 | | 4 | TABLE ACCESS STORAGE FULL | BOWIE_SKEW | 100000 | 2000000 | 570 | 00:00:01 | ----------------------------------------------------------------------------------------- - With Auto Indexes ----------------------------- Plan Hash Value : 140816325 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2000 | 4 | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_SKEW | 100 | 2000 | 4 | 00:00:01 | | * 2 | INDEX RANGE SCAN | SYS_AI_7psvzc164vbng | 100 | | 3 | 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): ------------------------------------------ * 2 - access("CODE"=6) Notes ----- - Dynamic sampling used for this statement ( level = 11 )
We can see that new execution plan indeed uses the index BUT interestingly, it has a correct cardinality estimate of 100 and not 100,000 as per the original plan.
Now this can be explained in that the Automatic Indexing process uses a Dynamic Sampling level of 11, meaning it can calculate the correct cardinality on the fly and can cause difficulties between what the Automatic Indexing process thinks the CBO costs will be vs. the CBO costs in a default database session that uses the (usually default) Dynamic Sampling level of 2 (as I’ve discussed previously).
BUT when I now rerun the SQL query again:
SQL> select * from bowie_skew where code=6; 100 rows selected. Execution Plan --------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| --------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2000 | 4 (0)| | 1 | PX COORDINATOR | | | | | | 2 | PX SEND QC (RANDOM) | :TQ10001 | 100 | 2000 | 4 (0)| | 3 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE_SKEW | 100 | 2000 | 4 (0)| | 4 | BUFFER SORT | | | | | | 5 | PX RECEIVE | | 100 | | 3 (0)| | 6 | PX SEND HASH (BLOCK ADDRESS) | :TQ10000 | 100 | | 3 (0)| | 7 | PX SELECTOR | | | | | |* 8 | INDEX RANGE SCAN | SYS_AI_7psvzc164vbng | 100 | | 3 (0)| --------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 8 - access("CODE"=6) Statistics ---------------------------------------------------------- 12 recursive calls 0 db block gets 103 consistent gets 0 physical reads 0 redo size 2796 bytes sent via SQL*Net to client 654 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 2 sorts (memory) 0 sorts (disk) 100 rows processed
We notice the new Automatic Index is now used BUT also that the CBO has now determined the correct cardinality estimate of 100. But how is this possible when I haven’t recalculated the table statistics?
I’ll explain in my next post.