jump to navigation

Costing Concatenated Indexes With Range Scan Predicates Part II (Coming Back To Life) July 27, 2022

Posted by Richard Foote in Automatic Indexing, CBO, Column Statistics, Concatenated Indexes, Explain Plan For Index, Full Table Scans, Index Access Path, Index Column Order, Index Column Reorder, Index Internals, Index statistics, Leaf Blocks, Non-Equality Predicates, Oracle, Oracle Blog, Oracle Cost Based Optimizer, Oracle General, Oracle Index Seminar, Oracle Indexes, Oracle Statistics, Performance Tuning, Richard Foote Training.
add a comment

In my previous Part I post, I discussed how the CBO basically stops the index leaf block access calculations after a non-equality predicate. This means that for an index with the leading indexed column being accessed via an unselective non-equality predicate, a large percentage of the index’s leaf blocks might need to be scanned, making the index access path unviable.

In the example in Part I, an index on the ID, CODE columns was too expensive due to the unselective range-scan predicate based on the leading ID column.

To provide the CBO a potentially much more efficient access path, we need an index with the more selective CODE predicate to be the leading column:

SQL> CREATE INDEX radiohead_code_id_i ON radiohead(code, id);

Index created.

SQL> SELECT index_name, blevel, leaf_blocks, clustering_factor

FROM user_indexes WHERE index_name = 'RADIOHEAD_CODE_ID_I';

INDEX_NAME                        BLEVEL LEAF_BLOCKS CLUSTERING_FACTOR
----------------------------- ---------- ----------- -----------------
RADIOHEAD_CODE_ID_I                    1         265             98619

If we now re-run the previous query:

SQL> SELECT * FROM radiohead WHERE id BETWEEN 1000 AND 5000 AND CODE = 140;

Execution Plan

-----------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                     |     4 |    72 |     6   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| RADIOHEAD           |     4 |    72 |     6   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | RADIOHEAD_CODE_ID_I |     4 |       |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          7  consistent gets
          0  physical reads
          0  redo size
        806  bytes sent via SQL*Net to client
        608  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          4  rows processed

We notice the CBO is now using this new index, as the costs for this index-based plan have dropped significantly, down to just 6 (from the previous 116). This overall cost of 6 is lower than the cost of 105 for the Full Table Scan and hence the reason why this index-based plan is now chosen by the CBO.

This is all due entirely to the significant drop in costs in accessing the index itself, now just 2 (from the previous 112).

Importantly, these much lower costs are accurate as we can tell via the reduced number of consistent reads, now just 7 (from 114 from the previous index-based plan).

If we now look at the associated costings:

Effective Index Selectivity = CODE selectivity x ID selectivity

= (1/10000) x ((5000-1000)/(10000-1) + 2 x (1/10000))

= 0.0001 x ((4000/9999) + 0.0002)

= 0.0001 x 0.40024)

= 0.000040024

Effective Table Selectivity = same as Index Selectivity

= 0.000040024

 

The effective index selectivity of 0.000040024 is now much lower than the previous (0.40024), as the CBO can now consider the product of the selectivities of both columns).

If we now plug this improved effective index selectivity into the index path costing calculations:

Index IO Cost = blevel +

ceil(effective index selectivity x leaf_blocks) +

ceil(effective table selectivity x clustering_factor)

 

Index IO Cost = 1  +  ceil(0.000040024 x 265) + ceil(0.000040024 x 99034)

= 1 + 1  + 4

= 2 + 4

= 6

Index Access Cost  = IO Costs + CPU Costs (in this plan, 0% of total costs and so unchanged from the IO costs)

= 2 + 4

= 6

We can see how the respective 2 and 6 improved CBO index costings are derived.

So again, it’s important to note that Automatic Indexing is doing entirely the correct thing with these examples, when it creates an index with the equality based predicate columns as the leading columns of the index…

Automatic Indexing: JSON Expressions Part II (“Without You I’m Nothing)” April 14, 2022

Posted by Richard Foote in 21c New Features, Automatic Indexing, Autonomous Database, Autonomous Transaction Processing, CBO, Column Statistics, High Frequency Statistics Collection.
2 comments

In my previous post on Automatic Indexing (AI) and JSON expressions, I mentioned two additional issues with the execution plan generated after the automatic indexes were created:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

Execution Plan
----------------------------------------------------------
Plan hash value: 832017402

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows | Bytes | Cost (%CPU) | Time      |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      | 20000 |  12M |    1524 (1) | 00:00:01  |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           | 20000 |  12M |    1524 (1) | 00:00:01  |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_ayvj257jd93cv |  8000 |      |       3 (0) | 00:00:01  |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42')

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     234168 consistent gets
     200279 physical reads
          0 redo size
       1595 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

The first issue is in relation to the Cost and the estimated cardinality (Rows) of the generated plan.

The execution plan has a Cost of 1524 but most importantly, an estimate cardinality of 20,000 rows. As only 1 row is actually returned, this row estimate is way way off and so therefore are the associated costs. This could potentially result in an inefficient plan and with the index not being used by the CBO.

The 20,000 row estimate comes from it being 1% of the number of rows (2 million) in the table. As I’ve discussed previously (as in this rather humourous post), the issue here is that the CBO has no idea what the expected cardinality might be, as the output from the JSON expression is effectively a black box.

Oracle generates virtual columns for this purpose, to capture column statistics that gives the CBO an accurate idea on the selectivity of expression based predicates.

But, if we look at the column statistics after the generation of the automatic indexes:

SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON';

TABLE_NAME COLUMN_NAME                                        NUM_DISTINCT VIR
---------- -------------------------------------------------- ------------ ---
BOWIE_JSON ID                                                      1996800 NO
BOWIE_JSON BOWIE_DATE                                                    1 NO
BOWIE_JSON BOWIE_ORDER                                                   0 NO
BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D                 1 YES
BOWIE_JSON SYS_NC00005$                                                    YES
BOWIE_JSON SYS_NC00006$                                                    YES

We notice that the two virtual columns generated for the JSON based expressions have no statistics. This is because we have yet to collect new statistics (or statistics specifically on hidden columns) since the creation of the automatic indexes. As a result, the CBO still has no idea on how many rows are estimated to come from the JSON based expressions and so has to resort to the (entirely wrong) 1% estimate.

The good news with Exadata environments (which of course includes the Autonomous Database environments), is that Oracle has the High Frequency Statistics Collection capability, which will automatically collect these missing statistics after a small (configurable) period of time. I’ve previously discussed High Frequency Statistics Collection here.

So if I just wait approximately 15 minutes in my “Exadata” environment and check out the columns statistics again:

SQL> select table_name, column_name, num_distinct, virtual_column from user_tab_cols where table_name='BOWIE_JSON';

TABLE_NAME COLUMN_NAME                                        NUM_DISTINCT VIR
---------- -------------------------------------------------- ------------ ---
BOWIE_JSON ID                                                      1996800 NO
BOWIE_JSON BOWIE_DATE                                                    1 NO
BOWIE_JSON BOWIE_ORDER                                                   0 NO
BOWIE_JSON SYS_IME_OSON_B33B6E113C404F2EBF6648C01F605C8D                 1 YES
BOWIE_JSON SYS_NC00005$                                            2000000 YES
BOWIE_JSON SYS_NC00006$                                            1996800 YES

We can see that the missing statistics have now been populated and the CBO can now accurately determine that these virtual columns are effectively unique.

If we now re-run the queries again, e.g.:

SQL> select * from bowie_json z where z.bowie_order.PONumber.number()=4242;

Execution Plan
----------------------------------------------------------
Plan hash value: 1921179906

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows | Bytes | Cost (%CPU) |  Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      |    1 |   669 |       4 (0) |  00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           |    1 |   669 |       4 (0) |  00:00:01 |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_gpdkwzugdn055 |    1 |       |       3 (0) |  00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           NUMBER ERROR ON ERROR NULL ON EMPTY)=4242)

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     233150 consistent gets
     200279 physical reads
          0 redo size
       1599 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

We can see that the CBO has now correctly estimated that just 1 row is to be returned and the associated CBO cost has reduced to just 4 (down from 1524) as a result.

So if you create a function-based index, make sure the generated virtual column (whether created automatically or if manually generated before the associated index) has the necessary statistics.

In the upcoming days, I’ll discuss the remaining issue associated with this plan…