Automatic Indexing: JSON Expressions Part II (“Without You I’m Nothing)” April 14, 2022
Posted by Richard Foote in 21c New Features, Automatic Indexing, Autonomous Database, Autonomous Transaction Processing, CBO, Column Statistics, High Frequency Statistics Collection.2 comments
In my previous post on Automatic Indexing (AI) and JSON expressions, I mentioned two additional issues with the execution plan generated after the automatic indexes were created:
SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42'; Execution Plan ---------------------------------------------------------- Plan hash value: 832017402 ------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 20000 | 12M | 1524 (1) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON | 20000 | 12M | 1524 (1) | 00:00:01 | |* 2 | INDEX RANGE SCAN | SYS_AI_ayvj257jd93cv | 8000 | | 3 (0) | 00:00:01 | ------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42') Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 234168 consistent gets 200279 physical reads 0 redo size 1595 bytes sent via SQL*Net to client 526 bytes received via SQL*Net from client 3 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
The first issue is in relation to the Cost and the estimated cardinality (Rows) of the generated plan.
The execution plan has a Cost of 1524 but most importantly, an estimate cardinality of 20,000 rows. As only 1 row is actually returned, this row estimate is way way off and so therefore are the associated costs. This could potentially result in an inefficient plan and with the index not being used by the CBO.
The 20,000 row estimate comes from it being 1% of the number of rows (2 million) in the table. As I’ve discussed previously (as in this rather humourous post), the issue here is that the CBO has no idea what the expected cardinality might be, as the output from the JSON expression is effectively a black box.
Oracle generates virtual columns for this purpose, to capture column statistics that gives the CBO an accurate idea on the selectivity of expression based predicates.
But, if we look at the column statistics after the generation of the automatic indexes:
SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON'; TABLE_NAME COLUMN_NAME NUM_DISTINCT VIR ---------- -------------------------------------------------- ------------ --- BOWIE_JSON ID 1996800 NO BOWIE_JSON BOWIE_DATE 1 NO BOWIE_JSON BOWIE_ORDER 0 NO BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D 1 YES BOWIE_JSON SYS_NC00005$ YES BOWIE_JSON SYS_NC00006$ YES
We notice that the two virtual columns generated for the JSON based expressions have no statistics. This is because we have yet to collect new statistics (or statistics specifically on hidden columns) since the creation of the automatic indexes. As a result, the CBO still has no idea on how many rows are estimated to come from the JSON based expressions and so has to resort to the (entirely wrong) 1% estimate.
The good news with Exadata environments (which of course includes the Autonomous Database environments), is that Oracle has the High Frequency Statistics Collection capability, which will automatically collect these missing statistics after a small (configurable) period of time. I’ve previously discussed High Frequency Statistics Collection here.
So if I just wait approximately 15 minutes in my “Exadata” environment and check out the columns statistics again:
SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON'; TABLE_NAME COLUMN_NAME NUM_DISTINCT VIR ---------- -------------------------------------------------- ------------ --- BOWIE_JSON ID 1996800 NO BOWIE_JSON BOWIE_DATE 1 NO BOWIE_JSON BOWIE_ORDER 0 NO BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D 1 YES BOWIE_JSON SYS_NC00005$ 2000000 YES BOWIE_JSON SYS_NC00006$ 1996800 YES
We can see that the missing statistics have now been populated and the CBO can now accurately determine that these virtual columns are effectively unique.
If we now re-run the queries again, e.g.:
SQL> select * from bowie_json z where z.bowie_order.PONumber.number()=4242; Execution Plan ---------------------------------------------------------- Plan hash value: 1921179906 ------------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ------------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 1 | 669 | 4 (0) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON | 1 | 669 | 4 (0) | 00:00:01 | |* 2 | INDEX RANGE SCAN | SYS_AI_gpdkwzugdn055 | 1 | | 3 (0) | 00:00:01 | ------------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING NUMBER ERROR ON ERROR NULL ON EMPTY)=4242) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 233150 consistent gets 200279 physical reads 0 redo size 1599 bytes sent via SQL*Net to client 526 bytes received via SQL*Net from client 3 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 1 rows processed
We can see that the CBO has now correctly estimated that just 1 row is to be returned and the associated CBO cost has reduced to just 4 (down from 1524) as a result.
So if you create a function-based index, make sure the generated virtual column (whether created automatically or if manually generated before the associated index) has the necessary statistics.
In the upcoming days, I’ll discuss the remaining issue associated with this plan…
Oracle 19c Automatic Indexing: Indexing With Stale Statistics Part I (Dead Against It) October 6, 2020
Posted by Richard Foote in 19c, 19c New Features, Automatic Indexing, Autonomous Data Warehouse, Autonomous Database, Autonomous Transaction Processing, CBO, Exadata, Exadata X8, Full Table Scans, High Frequency Statistics Collection, Index Access Path, Index statistics, Oracle, Oracle Cloud, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Performance Tuning, Stale Statistics, Unusable Indexes.5 comments
A “golden rule” when working with Automatic Indexing is that things don’t work properly if there are stale statistics on the dependant objects. Stale statistics can of course be problematic but they can be particularly troublesome when dealing with Automatic Indexing.
In the Oracle Autonomous Database environments, this issue is addressed somewhat by the new High Frequency Statistics Collection capability, which helps to automatically collect stale statistics on a regular basis. However, in on-prem Exadata environments where this can more easily be turned off or collected less frequently, it’s a potential issue worth consideration.
I’ll start with a simple little table, with a CODE column that has lots of distinct values:
SQL> create table bowie_stale (id number constraint bowie_stale_pk primary key, code number, name varchar2(42)); Table created. SQL> insert into bowie_stale select rownum, mod(rownum, 1000000)+1, 'David Bowie' from dual connect by level <= 10000000; 10000000 rows created. SQL> commit; Commit complete.
Importantly, I don’t collect statistics on this newly populated table…
SQL> select table_name, num_rows, blocks, last_analyzed from user_tables where table_name='BOWIE_STALE'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANAL --------------- ---------- ---------- --------- BOWIE_STALE SQL> select column_name, num_distinct, density, histogram, last_analyzed from user_tab_cols where table_name='BOWIE_STALE'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM LAST_ANAL -------------------- ------------ ---------- --------------- --------- ID NONE CODE NONE NAME NONE
If we now run the following query a number of times while there are no statistics on the table:
SQL> select * from bowie_stale where code=42; 10 rows selected. Execution Plan ----------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 437 | 21413 | 553 (16)| 00:00:01 | |* 1 | TABLE ACCESS STORAGE FULL| BOWIE_STALE | 437 | 21413 | 553 (16)| 00:00:01 | ----------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - storage("CODE"=42) filter("CODE"=42) Note ----- - dynamic statistics used: dynamic sampling (level=2) - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 6 recursive calls 0 db block gets 39026 consistent gets 0 physical reads 0 redo size 867 bytes sent via SQL*Net to client 588 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 10 rows processed
The CBO has no choice but to use a FTS as I don’t yet have an index on the CODE column.
If I now wait for the next Automatic Indexing task to kick in AND if there are still NO statistics on the table:
SQL> select dbms_auto_index.report_last_activity('text', 'ALL', 'ALL' ) report from dual; REPORT -------------------------------------------------------------------------------- GENERAL INFORMATION ------------------------------------------------------------------------------- Activity start : 05-JUL-2020 06:36:31 Activity end : 05-JUL-2020 06:37:07 Executions completed : 1 Executions interrupted : 0 Executions with fatal error : 0 ------------------------------------------------------------------------------- SUMMARY (AUTO INDEXES) ------------------------------------------------------------------------------- Index candidates : 1 Indexes created : 0 Space used : 0 B Indexes dropped : 0 SQL statements verified : 0 SQL statements improved : 0 SQL plan baselines created : 0 Overall improvement factor : 0x ------------------------------------------------------------------------------- SUMMARY (MANUAL INDEXES) ------------------------------------------------------------------------------- Unused indexes : 0 Space used : 0 B Unusable indexes : 0
You can see that there was the one index candidate BUT no Automatic Index appears to have been created.
Assuming there are still no statistics:
SQL> select table_name, num_rows, blocks, last_analyzed from user_tables where table_name='BOWIE_STALE'; TABLE_NAME NUM_ROWS BLOCKS LAST_ANAL ------------------------------ ---------- ---------- --------- BOWIE_STALE SQL> select column_name, num_distinct, density, histogram, last_analyzed from user_tab_cols where table_name='BOWIE_STALE2'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM LAST_ANAL -------------------- ------------ ---------- --------------- --------- ID NONE CODE NONE NAME NONE
If we look now at what indexes exist on the table:
SQL> select index_name, auto, constraint_index, visibility, compression, status, num_rows, leaf_blocks, clustering_factor from user_indexes where table_name='BOWIE_STALE'; INDEX_NAME AUT CON VISIBILIT COMPRESSION STATUS NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR ------------------------------ --- --- --------- ------------- -------- ---------- ----------- ----------------- BOWIE_STALE_PK NO YES VISIBLE DISABLED VALID SYS_AI_300kk2unp8tr0 YES NO INVISIBLE DISABLED UNUSABLE 0 0 0 SQL> select index_name, column_name, column_position from user_ind_columns where table_name='BOWIE_STALE2' order by index_name, column_position; INDEX_NAME COLUMN_NAME COLUMN_POSITION ------------------------------ -------------------- --------------- BOWIE_STALE_PK ID 1 SYS_AI_300kk2unp8tr0 CODE 1
We notice there is now an Automatic Index BUT it remains in an UNUSABLE/INVISIBLE state. This means the index can’t be used by the CBO.
So if we now re-run the SQL query again:
SQL> select * from bowie_stale where code=42; 10 rows selected. Execution Plan ----------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ----------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 437 | 21413 | 553 (16)| 00:00:01 | |* 1 | TABLE ACCESS STORAGE FULL| BOWIE_STALE | 437 | 21413 | 553 (16)| 00:00:01 | ----------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - storage("CODE"=42) filter("CODE"=42) Note ----- - dynamic statistics used: dynamic sampling (level=2) - automatic DOP: Computed Degree of Parallelism is 1 Statistics ---------------------------------------------------------- 6 recursive calls 0 db block gets 39026 consistent gets 0 physical reads 0 redo size 867 bytes sent via SQL*Net to client 588 bytes received via SQL*Net from client 2 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 10 rows processed
The CBO has no choice still but to use the FTS.
In Part II, we’ll see that once we get into this scenario, it can be a tad problematic to get ourselves out of it and get the Automatic Index created as we would like…
Oracle 19c Automatic Indexing: Data Skew Part II (Everything’s Alright) September 14, 2020
Posted by Richard Foote in 19c, 19c New Features, Automatic Indexing, Automatic Table Statistics, Autonomous Transaction Processing, Data Skew, Exadata, High Frequency Statistics Collection, Histograms, Oracle, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Oracle Statistics, Performance Tuning.5 comments
In my previous post, I discussed an example with data skew, in which the Automatic Indexing process created a new index, but somehow the CBO when using the index estimated the correct cardinality estimate even though no histograms were explicitly calculated.
In this post I’ll answer HOW this achieved by the CBO.
Get some idea on the answer by now looking at the column details:
SQL> select column_name, num_buckets, histogram from user_tab_cols where table_name='BOWIE_SKEW'; COLUMN_NAME NUM_BUCKETS HISTOGRAM --------------- ----------- --------------- ID 1 NONE CODE 10 FREQUENCY NAME 1 NONE
We can see that there is now indeed an histogram on the column. When and how were these histograms collected?
The answer lies with a new Oracle Database 19c feature called “High-Frequency Automatic Statistics Collection“, which is available on Exadata environments. As I’m running all these demos on the Oracle Autonomous Transaction Processing Cloud environment which runs on an Exadata platform, this feature is enabled by default.
To highlight the capabilities of this features more fully, I’m going to setup a slightly different demo with three tables:
SQL> create table bowie1 (id number, code number, name varchar2(42)); <= Stale with no stats Table created. SQL> insert into bowie1 select rownum, mod(rownum, 100)+1, 'David Bowie' from dual connect by level <= 100000; 100000 rows created. SQL> commit; Commit complete.
Table BOWIE1 has no statistics collected on it.
SQL> create table bowie2 (id number, code number, name varchar2(42)); Table created. SQL> insert into bowie2 select rownum, mod(rownum, 100)+1, 'David Bowie' from dual connect by level <= 100000; 100000 rows created. SQL> commit; Commit complete. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE2'); PL/SQL procedure successfully completed. SQL> insert into bowie2 select rownum+100000, mod(rownum, 100)+1, 'Ziggy Stardust' from dual connect by level <= 50000; 50000 rows created. SQL> commit; Commit complete.
BOWIE2 table has new rows added after statistics have been collected and so has “stale” outdated stats.
SQL> create table bowie3 (id number, code number, name varchar2(42)); Table created. SQL> insert into bowie3 select rownum, 10, 'DAVID BOWIE' from dual connect by level <=1000000; 1000000 rows created. SQL> update bowie3 set code = 9 where mod(id,3) = 0; 333333 rows updated. SQL> update bowie3 set code = 1 where mod(id,2) = 0 and id between 1 and 20000; 10000 rows updated. SQL> update bowie3 set code = 2 where mod(id,2) = 0 and id between 30001 and 40000; 5000 rows updated. SQL> update bowie3 set code = 3 where mod(id,100) = 0 and id between 300001 and 400000; 1000 rows updated. SQL> update bowie3 set code = 4 where mod(id,100) = 0 and id between 400001 and 500000; 1000 rows updated. SQL> update bowie3 set code = 5 where mod(id,100) = 0 and id between 600001 and 700000; 1000 rows updated. SQL> update bowie3 set code = 6 where mod(id,1000) = 0 and id between 700001 and 800000; 100 rows updated. SQL> update bowie3 set code = 7 where mod(id,1000) = 0 and id between 800001 and 900000; 100 rows updated. SQL> update bowie3 set code = 8 where mod(id,1000) = 0 and id between 900001 and 1000000; 100 rows updated. SQL> commit; Commit complete. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'bowie3', estimate_percent=>100, method_opt=>'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> select code, count(*) from bowie3 group by code order by code; CODE COUNT(*) ---------- ---------- 1 10000 2 5000 3 1000 4 1000 5 1000 6 100 7 100 8 100 9 327235 10 654465
The BOWIE3 table is as my previous example, with data skew but with NO histograms collected. I’m now going to run a query on BOWIE3 where the CBO gets the cardinality estimate hopelessly wrong because of the missing histogram on the CODE column:
SQL> select * from bowie3 where code=7; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2517725203 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100K| 1953K| 974 (2)| 00:00:01 | |* 1 | TABLE ACCESS FULL| BOWIE3 | 100K| 1953K| 974 (2)| 00:00:01 | ---------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CODE"=7)
If we look at the current statistics on these tables:
SQL> select table_name, num_rows, stale_stats, notes from user_tab_statistics where table_name in ('BOWIE1', 'BOWIE2', 'BOWIE3'); TABLE_NAME NUM_ROWS STALE_S NOTES --------------- ---------- ------- ------------------------------ BOWIE1 BOWIE2 100000 YES BOWIE3 1000000 NO BOWIE2 150000 STATS_ON_CONVENTIONAL_DML
We can see that BOWIE1 has indeed no statistics.
BOWIE2 is marked as having state statistics, although thanks to another Oracle Database 19c feature called “Real-Time Statistics Collection“, does have some additional statistics captured (such as NUM_ROWS) when the additional rows were inserted. I’ll discuss this feature more fully in a later blog article.
BOWIE3 is considered fine in that it does have statistics which are NOT stale, BUT…
SQL> select column_name, num_buckets, histogram from user_tab_col_statistics where table_name='BOWIE3'; COLUMN_NAME NUM_BUCKETS HISTOGRAM --------------- ----------- --------------- ID 1 NONE CODE 1 NONE NAME 1 NONE
We don’t currently have any histograms even though a simple single table query was previously run based on a CODE predicate which had hopelessly inaccurate cardinality estimates.
If we wait approximately 15 minutes (default) for the High-Frequency Automatic Statistics Collection process to run and look at these column statistics again:
SQL> select table_name, num_rows, stale_stats from user_tab_statistics where table_name in ('BOWIE1', 'BOWIE2', 'BOWIE3'); TABLE_NAME NUM_ROWS STALE_S --------------- ---------- ------- BOWIE1 100000 NO BOWIE2 150000 NO BOWIE3 1000000 NO SQL> select column_name, num_buckets, histogram from user_tab_col_statistics where table_name='BOWIE3'; COLUMN_NAME NUM_BUCKETS HISTOGRAM --------------- ----------- --------------- ID 1 NONE CODE 10 FREQUENCY NAME 1 NONE
We now notice that:
BOWIE1 now has statistics captured, as the High-Frequency Automatic Statistics Collection process looks for tables with missing statistics.
BOWIE2 now has fully up to date statistics, as the High-Frequency Automatic Statistics Collection process looks for tables with stale statistics.
BOWIE3 now has histograms on the CODE columns, as the High-Frequency Automatic Statistics Collection process looks out for missing histograms if queries have been subsequently run with poor cardinality estimates.
Having more accurate, appropriate and up to date statistics all supports the CBO in making much better decisions in relation to the use of any newly created Automatic Indexes.
You can configure High-Frequency Automatic Statistics Collection in the following manner:
SQL> EXEC DBMS_STATS.SET_GLOBAL_PREFS('AUTO_TASK_STATUS','ON'); PL/SQL procedure successfully completed.
This turns the feature ON/OFF. It’s OFF by default on standard Exadata environments but ON by default in Autonomous Database environment.
SQL> EXEC DBMS_STATS.SET_GLOBAL_PREFS('AUTO_TASK_MAX_RUN_TIME','900'); PL/SQL procedure successfully completed.
This configures how long to allow the process to run (default is 3600 seconds/60 minutes).
SQL> EXEC DBMS_STATS.SET_GLOBAL_PREFS('AUTO_TASK_INTERVAL','900'); PL/SQL procedure successfully completed.
This configures the interval between the process running (default is every 900 seconds/15 minutes).
In my next post, I’ll look at a slightly more complex data skew example with Automatic Indexing, where both selective and unselective SQL predicates are invoked…