jump to navigation

Automatic Indexing: JSON Expressions Part II (“Without You I’m Nothing)” April 14, 2022

Posted by Richard Foote in 21c New Features, Automatic Indexing, Autonomous Database, Autonomous Transaction Processing, CBO, Column Statistics, High Frequency Statistics Collection.
trackback

In my previous post on Automatic Indexing (AI) and JSON expressions, I mentioned two additional issues with the execution plan generated after the automatic indexes were created:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

Execution Plan
----------------------------------------------------------
Plan hash value: 832017402

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows | Bytes | Cost (%CPU) | Time      |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      | 20000 |  12M |    1524 (1) | 00:00:01  |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           | 20000 |  12M |    1524 (1) | 00:00:01  |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_ayvj257jd93cv |  8000 |      |       3 (0) | 00:00:01  |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42')

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     234168 consistent gets
     200279 physical reads
          0 redo size
       1595 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

The first issue is in relation to the Cost and the estimated cardinality (Rows) of the generated plan.

The execution plan has a Cost of 1524 but most importantly, an estimate cardinality of 20,000 rows. As only 1 row is actually returned, this row estimate is way way off and so therefore are the associated costs. This could potentially result in an inefficient plan and with the index not being used by the CBO.

The 20,000 row estimate comes from it being 1% of the number of rows (2 million) in the table. As I’ve discussed previously (as in this rather humourous post), the issue here is that the CBO has no idea what the expected cardinality might be, as the output from the JSON expression is effectively a black box.

Oracle generates virtual columns for this purpose, to capture column statistics that gives the CBO an accurate idea on the selectivity of expression based predicates.

But, if we look at the column statistics after the generation of the automatic indexes:

SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON';

TABLE_NAME COLUMN_NAME                                        NUM_DISTINCT VIR
---------- -------------------------------------------------- ------------ ---
BOWIE_JSON ID                                                      1996800 NO
BOWIE_JSON BOWIE_DATE                                                    1 NO
BOWIE_JSON BOWIE_ORDER                                                   0 NO
BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D                 1 YES
BOWIE_JSON SYS_NC00005$                                                    YES
BOWIE_JSON SYS_NC00006$                                                    YES

We notice that the two virtual columns generated for the JSON based expressions have no statistics. This is because we have yet to collect new statistics (or statistics specifically on hidden columns) since the creation of the automatic indexes. As a result, the CBO still has no idea on how many rows are estimated to come from the JSON based expressions and so has to resort to the (entirely wrong) 1% estimate.

The good news with Exadata environments (which of course includes the Autonomous Database environments), is that Oracle has the High Frequency Statistics Collection capability, which will automatically collect these missing statistics after a small (configurable) period of time. I’ve previously discussed High Frequency Statistics Collection here.

So if I just wait approximately 15 minutes in my “Exadata” environment and check out the columns statistics again:

SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON';

TABLE_NAME COLUMN_NAME                                        NUM_DISTINCT VIR
---------- -------------------------------------------------- ------------ ---
BOWIE_JSON ID                                                      1996800 NO
BOWIE_JSON BOWIE_DATE                                                    1 NO
BOWIE_JSON BOWIE_ORDER                                                   0 NO
BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D                 1 YES
BOWIE_JSON SYS_NC00005$                                            2000000 YES
BOWIE_JSON SYS_NC00006$                                            1996800 YES

We can see that the missing statistics have now been populated and the CBO can now accurately determine that these virtual columns are effectively unique.

If we now re-run the queries again, e.g.:

SQL> select * from bowie_json z where z.bowie_order.PONumber.number()=4242;

Execution Plan
----------------------------------------------------------
Plan hash value: 1921179906

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows | Bytes | Cost (%CPU) |  Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      |    1 |   669 |       4 (0) |  00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           |    1 |   669 |       4 (0) |  00:00:01 |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_gpdkwzugdn055 |    1 |       |       3 (0) |  00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           NUMBER ERROR ON ERROR NULL ON EMPTY)=4242)

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     233150 consistent gets
     200279 physical reads
          0 redo size
       1599 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

We can see that the CBO has now correctly estimated that just 1 row is to be returned and the associated CBO cost has reduced to just 4 (down from 1524) as a result.

So if you create a function-based index, make sure the generated virtual column (whether created automatically or if manually generated before the associated index) has the necessary statistics.

In the upcoming days, I’ll discuss the remaining issue associated with this plan…

Comments»

1. Automatic Indexing: Deferred Invalidations (“The Post War Dream”) | Richard Foote's Oracle Blog - April 19, 2022

[…] my previous post on how JSON expressions can now be automatically indexed, I mentioned there was an outstanding […]

Like

2. Automatic Indexes: Scenarios Where Automatic Indexes NOT Created Part I (“Always Crashing In The Same Car”) | Richard Foote's Oracle Blog - April 26, 2022

[…] As I’ve discussed previously, Oracle has increased the number of scenarios in which it will now create automatic indexes, such as with non-equality predicates and JSON expressions. […]

Like


Leave a comment