jump to navigation

Automatic Indexing: Deferred Invalidations (“The Post War Dream”) April 19, 2022

Posted by Richard Foote in 21c New Features, Automatic Indexing, Autonomous Database, Autonomous Transaction Processing, CBO, Deferred Invalidation, Exadata, Function Based Indexes, Index Access Path, Index Internals, JSON, Oracle, Oracle Blog, Oracle Cloud, Oracle Cost Based Optimizer, Oracle Indexes, Richard's Blog.
2 comments

In my previous post on how JSON expressions can now be automatically indexed, I mentioned there was an outstanding issue with the associated CBO execution plan, immediately post the creation of the automatic index:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

Execution Plan
----------------------------------------------------------
Plan hash value: 832017402

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows  | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      | 20000 |   12M |    1524 (1) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           | 20000 |   12M |    1524 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_ayvj257jd93cv | 8000  |       |       3 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42')

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     234168 consistent gets
     200279 physical reads
          0 redo size
       1595 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

If we look at the number of recursive calls, we notice that it remains at 0. If we look at both the number of consistent gets (234168) and physical reads (200279), they both remain very high and identical to that of the previous Full Table Scan plan.

Basically, although autotrace suggests the newly created automatic index is being used, in fact the previous Full Table Scan plan is still being invoked.  (Note: this of course is one of the dangers of the autotrace plan, in that it might not display the actual plan being invoked).

So what’s going on here?

The Oracle Database 21c New Features Guide makes the following point: “an enhancement has been introduced to reduce the overhead of cursor invalidations when a new automatic index is created”.

Oracle 12.2 introduced a new feature in which one can now defer the invalidation of dependent SQL cursors when an index is created or modified. I’ve of course discussed this previously in this 12.2 Index Deferred Invalidation post.

When an automatic index is created in 21c, the current SQL cursors are NOT invalidated (to reduce the overhead of having to potentially reparse of large number of current SQL cursors). However, this means that currently inefficient SQL statements will keep their existing sub-optimal execution plans post the creation of newly created automatic indexes, until the existing SQL cursors aged out.

At which point, the new CBO plan using the automatic index will actually be invoked:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

Execution Plan
----------------------------------------------------------
Plan hash value: 832017402

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows | Bytes | Cost (%CPU) |  Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      |    1 |   671 |       4 (0) |  00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           |    1 |   671 |       4 (0) |  00:00:01 |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_ayvj257jd93cv |    1 |       |       3 (0) |  00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42')

Statistics
----------------------------------------------------------
          30 recursive calls
           0 db block gets
          46 consistent gets
          11 physical reads
           0 redo size
        1595 bytes sent via SQL*Net to client
         526 bytes received via SQL*Net from client
           3 SQL*Net roundtrips to/from client
           0 sorts (memory)
           0 sorts (disk)
           1 rows processed

So just be aware in Oracle Database 21c that your beautifully created automatic indexes may not actually get used as desired for a period of time…

Automatic Indexing: JSON Expressions Part I (Making Plans For Nigel) April 13, 2022

Posted by Richard Foote in Automatic Indexing, Autonomous Database, CBO, Exadata, Function Based Indexes, Index statistics, JSON, Oracle, Oracle Cloud, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Oracle Statistics, Virtual Columns.
1 comment so far

When Automatic Indexing was first released, one of the restrictions was that automatic indexes on JSON expressions were NOT supported.

However, the Oracle Database 21c doco mentions:

Automatic indexes can be single or multi-column. They are considered for the following: Selected expressions (for example, JSON expressions)“.

So on my (admittedly dodgy) “Exadata” VM, I thought I’ll check out how AI now indeed deals with JSON expressions.

I start by creating a simple little table that uses the new 21c JSON datatype and populate it with some JSON documents (note the PONumber key has effectively unique numeric values assigned):

SQL> CREATE TABLE bowie_json
       (id number,
        bowie_date date,
        bowie_order JSON);

SQL> insert into bowie_json
     select
     rownum,
     sysdate,
     '{"PONumber" : ' || rownum || ',
       "Reference" : "2022' || rownum || 'DBOWIE",
       "Requestor" : "David Bowie",
       "User" : "DBOWIE",
       "CostCenter" : "A42",
       "ShippingInstructions" : {"name" : "David Bowie",
                                 "Address": {"street" : "42 Ziggy Street",
                                             "city" : "Canberra",
                                              "state" : "ACT",
                                              "zipCode" : 2601,
                                              "country" : "Australia"},
                                 "Phone" : [{"type" : "Office", "number" : "417-555-7777"},
                                            {"type" : "Mobile", "number" : "417-555-1234"}]},
       "Special Instructions" : null,
       "AllowPartialShipment" : true,
       "LineItems" : [{"ItemNumber" : 1,
                       "Part" : {"Description" : "Hunky Dory",
                                 "UnitPrice" : 10.95},
                                  "Quantity" : 5.0},
                      {"ItemNumber" : 2,
                       "Part" : {"Description" : "Pin-Ups",
                                 "UnitPrice" : 10.95},
                                 "Quantity" : 3.0}]}'
from dual connect by level <= 2000000;

2000000 rows created.

SQL> commit;

Commit complete

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE_JSON');

PL/SQL procedure successfully completed.

As always, it’s important to ensure the table has statistics, as AI does not work properly without them.

I then run a number of SQL statements, with different JSON expression based predicates, including:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

SQL> select * from bowie_json z where z.bowie_order.PONumber.number()=4242;

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber' returning number)=42;

Execution Plan
----------------------------------------------------------
Plan hash value: 1196930810

--------------------------------------------------------------------------------
| Id | Operation         | Name       | Rows  | Bytes | Cost (%CPU)| Time      |
--------------------------------------------------------------------------------
|  0 | SELECT STATEMENT  |            | 20000 |   12M |  34476 (1) | 00:00:02  |
|* 1 | TABLE ACCESS FULL | BOWIE_JSON | 20000 |   12M |  34476 (1) | 00:00:02  |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON
           , '$.PONumber' RETURNING NUMBER NULL ON ERROR)=42)

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     259127 consistent gets
     200279 physical reads
          0 redo size
       1595 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

They all return just the one row, but must currently use a Full Table Scan with no indexes present.

So what does AI make of things?

The first thing to note is that running the AI last activity report generates the following error:

SQL> select dbms_auto_index.report_last_activity() report from dual;
ERROR:
ORA-30954: char 0 is invalid in json_value(BOWIE_ORDER, '$.PONumber' returning VA
ORA-06512: at "SYS.DBMS_AUTO_INDEX", line 177
ORA-06512: at "SYS.DBMS_AUTO_INDEX", line 107
ORA-06512: at "SYS.DBMS_AUTO_INDEX_INTERNAL", line 8676
ORA-06512: at "SYS.DBMS_AUTO_INDEX_INTERNAL", line 8676
ORA-06512: at "SYS.DBMS_AUTO_INDEX_INTERNAL", line 9226
ORA-06512: at "SYS.DBMS_AUTO_INDEX", line 89
ORA-06512: at "SYS.DBMS_AUTO_INDEX", line 167
ORA-06512: at line 1

no rows selected

If we look at the indexes now present with the table:

SQL> select index_name, index_type, auto, visibility, status, num_rows, leaf_blocks, clustering_factor from user_indexes
where table_name='BOWIE_JSON';

INDEX_NAME                INDEX_TYPE                AUT VISIBILIT STATUS     NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR
------------------------- ------------------------- --- --------- -------- ---------- ----------- -----------------
SYS_IL0000081096C00003$$  LOB                       NO  VISIBLE   VALID
SYS_AI_ayvj257jd93cv      FUNCTION-BASED NORMAL     YES VISIBLE   VALID       2000000        5141            380000
SYS_AI_gpdkwzugdn055      FUNCTION-BASED NORMAL     YES VISIBLE   VALID       2000000        4596            200000

SQL> select index_name, column_expression from user_ind_expressions where table_name='BOWIE_JSON';

INDEX_NAME                COLUMN_EXPRESSION
------------------------- --------------------------------------------------------------------------------
SYS_AI_ayvj257jd93cv      JSON_VALUE("BOWIE_ORDER" FORMAT OSON , '$.PONumber' RETURNING VARCHAR2(4000) ERR
OR ON ERROR NULL ON EMPTY)

SYS_AI_gpdkwzugdn055      JSON_VALUE("BOWIE_ORDER" FORMAT OSON , '$.PONumber' RETURNING NUMBER ERROR ON ER
ROR NULL ON EMPTY)

We can see that AI has indeed created two new automatic indexes, one on the VARCHAR2 JSON expression and one on the NUMBER JSON expression.

If we re-run the SQLs, we notice 3 very important points. Note the following example was run soon after the automatic indexes were created:

SQL> select * from bowie_json where json_value(bowie_order, '$.PONumber')='42';

Execution Plan
----------------------------------------------------------
Plan hash value: 832017402

------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                 | Rows  | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                      | 20000 |   12M |    1524 (1) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE_JSON           | 20000 |   12M |    1524 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | SYS_AI_ayvj257jd93cv |  8000 |       |       3 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("BOWIE_ORDER" /*+ LOB_BY_VALUE */ FORMAT OSON , '$.PONumber' RETURNING
           VARCHAR2(4000) ERROR ON ERROR NULL ON EMPTY)='42')

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     234168 consistent gets
     200279 physical reads
          0 redo size
       1595 bytes sent via SQL*Net to client
        526 bytes received via SQL*Net from client
          3 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

The first point to note is that the CBO now chooses to use the newly created automatic index. As only one row is return, this is as one would hope.

But there are two other very important points/issues worth making about the above execution plan and associated costs and statistics. One is associated with new AI behaviour introduced in 21c and the other is associated with an old trap in relation to function-based indexes.

I’ll leave it to the discernible reader to spot these issues, before I cover them in Part II in the coming days…

Oracle 19c Automatic Indexing: Function-Based Indexes? Part II (If You Can See Me) February 5, 2021

Posted by Richard Foote in 19c, 19c New Features, Automatic Indexing, Autonomous Database, Autonomous Transaction Processing, CBO, Exadata, Function Based Indexes, Oracle, Oracle Blog, Oracle Cloud, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Oracle19c, Virtual Columns.
1 comment so far

In my previous post, I discussed how Automatic Indexing does not currently support creating an index based on a function or expression predicate, even if it’s an equality predicate. You must manually create the associated function-based index.

However, if you have access to the application, there’s a better strategy when frequently searching on a function-based predicate. That’s to create a Virtual Column and use this column in your searching criteria (as mentioned by Connor McDonald in this comment).

To illustrate, I’m going to drop the previously manually created function-based index and hence the associated hidden virtual column, as Oracle quite rightly doesn’t allow you to have two virtual columns based on the same expression in the same table.

SQL> drop index david_upper_name_i;

Index dropped.

Since Oracle 11g, Oracle has supported the use of Visible Virtual Columns, a column that doesn’t physically exist, but defines a function/expression that can be easily accessed and populated when queried.

I’ll next create a Virtual Column called UPPER_NAME that is defined not based on a Data Type, but on the result on the UPPER function on the previously defined NAME column:

SQL> alter table david add (upper_name as (upper(name)));

Table altered.

Regardless of size of table, this column is added virtually instantly (pun fully intended), as no data is physically stored in the table itself. I view it (yep, another pun) as a “mini-view”, that can be used to hide complexity from the developer, with the actual data derived at run-time when the column is accessed in an SQL.

After I generate fresh statistics:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'DAVID', estimate_percent=>null);

PL/SQL procedure successfully completed.

SQL> select column_name, hidden_column, virtual_column, num_distinct, density, histogram from dba_tab_cols where table_name='DAVID';

COLUMN_NAME          HID VIR NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- --- --- ------------ ---------- ---------------
NAME                 NO  NO      10000000          0 HYBRID
MORE_STUFF9          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF8          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF7          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF6          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF5          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF4          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF3          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF2          NO  NO             1  .00000005 FREQUENCY
MORE_STUFF10         NO  NO             1  .00000005 FREQUENCY
MORE_STUFF1          NO  NO             1  .00000005 FREQUENCY
ID                   NO  NO      10000000          0 HYBRID
CODE                 NO  NO         10000      .0001 HYBRID
UPPER_NAME           NO YES      10000000          0 HYBRID

Note how the UPPER_NAME virtual column is NOT hidden and now has up to date statistics.

We can now run this simplified query based on the new UPPER_NAME column, which does not need to include the potentially complex function expression:

SQL> select * from david where upper_name='DAVID BOWIE 42';

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2426813604

-----------------------------------------------------------------------------------
| Id | Operation                 | Name  | Rows | Bytes | Cost (%CPU) | Time      |
-----------------------------------------------------------------------------------
|  0 | SELECT STATEMENT          |       |    1 |   200 |    3349 (6) | 00:00:01  | 
|* 1 | TABLE ACCESS STORAGE FULL | DAVID |    1 |   200 |    3349 (6) | 00:00:01  |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("UPPER_NAME"='DAVID BOWIE 42')
    filter("UPPER_NAME"='DAVID BOWIE 42')

Note
-----
- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     263469 consistent gets
     263452 physical reads
          0 redo size
       1328 bytes sent via SQL*Net to client
        375 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

If we look at portions of the subsequent Automatic Indexing report:

 

SUMMARY (AUTO INDEXES)
-------------------------------------------------------------------------------
Index candidates                             : 1
Indexes created (visible / invisible)        : 1 (1 / 0)
Space used (visible / invisible)             : 360.71 MB (360.71 MB / 0 B)
Indexes dropped                              : 0
SQL statements verified                      : 2
SQL statements improved (improvement factor) : 2 (263476.8x)
SQL plan baselines created                   : 0
Overall improvement factor                   : 263476.8x
-------------------------------------------------------------------------------

SUMMARY (MANUAL INDEXES)
-------------------------------------------------------------------------------
Unused indexes   : 0
Space used       : 0 B
Unusable indexes : 0
-------------------------------------------------------------------------------

INDEX DETAILS
-------------------------------------------------------------------------------
The following indexes were created:
-------------------------------------------------------------------------------
---------------------------------------------------------------------------
| Owner | Table | Index                | Key        | Type   | Properties |
---------------------------------------------------------------------------
| BOWIE | DAVID | SYS_AI_4k4mkgkw049ht | UPPER_NAME | B-TREE | NONE       |
---------------------------------------------------------------------------
-------------------------------------------------------------------------------

VERIFICATION DETAILS
-------------------------------------------------------------------------------
The performance of the following statements improved:
-------------------------------------------------------------------------------
Parsing Schema Name : BOWIE
SQL ID              : 7tfqh3pu526mt
SQL Text            : select * from david where upper_name='DAVID BOWIE 42'
Improvement Factor  : 263484.7x

Execution Statistics:
-----------------------------
                        Original Plan                Auto Index Plan
                        ---------------------------- ----------------------------
Elapsed Time (s):       1471249                      1414
CPU Time (s):           300584                       986
Buffer Gets:            3161816                      4
Optimizer Cost:         3349                         4
Disk Reads:             3161432                      3
Direct Writes:          0                            0
Rows Processed:         12                           1
Executions:             12                           1

PLANS SECTION
--------------------------------------------------------------------------------
- Original
-----------------------------
Plan Hash Value : 2426813604

-----------------------------------------------------------------------------
| Id | Operation                 | Name  | Rows | Bytes | Cost | Time       |
-----------------------------------------------------------------------------
|  0 | SELECT STATEMENT          |       |      |       | 3349 |            |
|  1 | TABLE ACCESS STORAGE FULL | DAVID |    1 |   200 | 3349 | 00:00:01   |
-----------------------------------------------------------------------------

Notes
-----
- dop = 1
- px_in_memory_imc = no
- px_in_memory = no
- cardinality_feedback = yes

- With Auto Indexes
-----------------------------
Plan Hash Value : 1447691372

-------------------------------------------------------------------------------------------------------
| Id  | Operation                            | Name                 | Rows | Bytes | Cost | Time      |
-------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                     |                      |    1 |   200 |    4 | 00:00:01  |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED | DAVID                |    1 |   200 |    4 | 00:00:01  |
| * 2 |  INDEX RANGE SCAN                    | SYS_AI_4k4mkgkw049ht |    1 |       |    3 | 00:00:01  |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 2 - access("UPPER_NAME"='DAVID BOWIE 42')

Notes
-----
- Dynamic sampling used for this statement ( level = 11 )

 

We see from the report that Automatic Indexing has now created the associated, implicitly created function-based index (SYS_AI_4k4mkgkw049ht) based on the virtual UPPER_NAME column:

SQL> select index_name, index_type, auto, constraint_index, visibility, status, num_rows, leaf_blocks, clustering_factor
from user_indexes where table_name='DAVID';

INDEX_NAME           INDEX_TYPE                  AUT CON VISIBILIT STATUS     NUM_ROWS LEAF_BLOCKS CLUSTERING_FACTOR
-------------------- --------------------------- --- --- --------- -------- ---------- ----------- -----------------
SYS_AI_4k4mkgkw049ht FUNCTION-BASED NORMAL       YES NO  VISIBLE   VALID      10000000       43104           2136839

SQL> select index_name, column_name, column_position
from user_ind_columns where table_name='DAVID' order by index_name, column_position;

INDEX_NAME           COLUMN_NAME          COLUMN_POSITION
-------------------- -------------------- ---------------
SYS_AI_4k4mkgkw049ht UPPER_NAME                         1

 

If we now re-run the SQL query:

SQL> select * from david where upper_name='DAVID BOWIE 4242';

1 row selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1447691372

------------------------------------------------------------------------------------------------------------
| Id | Operation                            | Name                 | Rows | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                     |                      |    1 |   200 |       4 (0) | 00:00:01 |
|  1 |  TABLE ACCESS BY INDEX ROWID BATCHED | DAVID                |    1 |   200 |       4 (0) | 00:00:01 |
|* 2 |   INDEX RANGE SCAN                   | SYS_AI_4k4mkgkw049ht |    1 |       |       3 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("UPPER_NAME"='DAVID BOWIE 4242')

Note
-----
- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
          5 consistent gets
          0 physical reads
          0 redo size
       1334 bytes sent via SQL*Net to client
        377 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

The CBO now uses the new Automatic Index to significantly improve the performance of the query.

So not only is using a user defined Virtual Column a cleaner solution with respect to the frequent use of a function-based expressions, but has the added advantage of being supported with Automatic Indexing.

Oracle 19c Automatic Indexing: Function-Based Indexes? (No Plan) February 4, 2021

Posted by Richard Foote in 19c, 19c New Features, Autonomous Database, Autonomous Transaction Processing, CBO, Exadata, Function Based Indexes, Oracle, Oracle Cloud, Oracle General, Oracle Indexes, Oracle19c, Virtual Columns.
3 comments

I previously discussed how Automatic Indexing only currently supports Equality based predicates.

The question I have today is does Automatic Indexing support function-based indexes? Let’s take a look.

The below DAVID table has the key column NAME which is an effectively unique VARCHAR2 column:

SQL> create table david (id number, code number, name varchar2(42), more_stuff1 varchar2(42), more_stuff2 varchar2(42), more_stuff3 varchar2(42), more_stuff4 varchar2(42), more_stuff5 varchar2(42), more_stuff6 varchar2(42), more_stuff7 varchar2(42), more_stuff8 varchar2(42), more_stuff9 varchar2(42), more_stuff10 varchar2(42));

Table created.

SQL> insert into david select rownum, mod(rownum, 10000)+1, 'David Bowie '|| rownum, 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust', 'Ziggy Stardust' from dual connect by level <=10000000;

10000000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'DAVID', estimate_percent=>null);

PL/SQL procedure successfully completed.

If we look at the current details of the table columns:

SQL> select column_name, num_distinct, density, histogram from dba_tab_cols where table_name='DAVID';

COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- ------------ ---------- ---------------
NAME                     10000000          0 HYBRID
MORE_STUFF9                     1  .00000005 FREQUENCY
MORE_STUFF8                     1  .00000005 FREQUENCY
MORE_STUFF7                     1  .00000005 FREQUENCY
MORE_STUFF6                     1  .00000005 FREQUENCY
MORE_STUFF5                     1  .00000005 FREQUENCY
MORE_STUFF4                     1  .00000005 FREQUENCY
MORE_STUFF3                     1  .00000005 FREQUENCY
MORE_STUFF2                     1  .00000005 FREQUENCY
MORE_STUFF10                    1  .00000005 FREQUENCY
MORE_STUFF1                     1  .00000005 FREQUENCY
ID                       10000000          0 HYBRID
CODE                        10000      .0001 HYBRID

We notice the same oddity of my previous post that all columns have histograms…

Let’s run the following query with an UPPER function-based predicate that returns only the one row:

SQL> select * from david where upper(name) = 'DAVID BOWIE 4242';

Execution Plan
----------------------------------------------------------
Plan hash value: 2426813604

-----------------------------------------------------------------------------------
| Id | Operation                 | Name  | Rows | Bytes | Cost (%CPU) | Time      |
-----------------------------------------------------------------------------------
|  0 | SELECT STATEMENT          |       | 100K |   17M |    3350 (6) | 00:00:01  |
|* 1 | TABLE ACCESS STORAGE FULL | DAVID | 100K |   17M |    3350 (6) | 00:00:01  |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage(UPPER("NAME")='DAVID BOWIE 4242')
    filter(UPPER("NAME")='DAVID BOWIE 4242')

Note
-----
- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
     263469 consistent gets
     263452 physical reads
          0 redo size
       1256 bytes sent via SQL*Net to client
        381 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

What does Automatic Indexing make of this scenario?

Basically, it does nothing. Currently, Automatic Indexing does NOT support such function-based indexes, even with equality based predicates (as of at least version 19.5.0.0.0). If we look at the next Automatic Indexing report:

SUMMARY (AUTO INDEXES)
-------------------------------------------------------------------------------
Index candidates           : 0
Indexes created            : 0
Space used                 : 0 B
Indexes dropped            : 0
SQL statements verified    : 2
SQL statements improved    : 0
SQL plan baselines created : 0
Overall improvement factor : 0x
-------------------------------------------------------------------------------

SUMMARY (MANUAL INDEXES)
-------------------------------------------------------------------------------
Unused indexes   : 0
Space used       : 0 B
Unusable indexes : 0
-------------------------------------------------------------------------------

No such function-based index is ever created by Automatic Indexing:

SQL> select index_name, auto, constraint_index, visibility, compression, status, num_rows, leaf_blocks, clustering_factor from user_indexes where table_name='DAVID';

no rows selected

To improve the performance of this query, one has to manually create the necessary function-based index:

SQL> create index david_upper_name_i on david(upper(name));

Index created.

If we now re-run the query:

SQL> select name from david where upper(name) = 'DAVID BOWIE 4242';

Execution Plan
----------------------------------------------------------
Plan hash value: 2675555529

----------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name               | Rows  | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                    |  100K | 4199K |    3175 (1) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | DAVID              |  100K | 4199K |    3175 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | DAVID_UPPER_NAME_I | 40000 |       |       3 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(UPPER("NAME")='DAVID BOWIE 4242')

Note
-----
- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
          0 recursive calls
          0 db block gets
          5 consistent gets
          0 physical reads
          0 redo size
        369 bytes sent via SQL*Net to client
        384 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

The query now uses the function-based index to significantly improve the performance of this query, with just 5 consistent gets.

Note however as with all function-based indexes, by default the estimated cardinality estimate and associated CBO costs are way off (100K rows are estimated, not the 1 row that is actually returned). This is due to the CBO having no real idea of the number and distribution of values coming out of the “black box” function-based predicate.

This is why Oracle automatically creates an hidden virtual column by which to store the necessary statistics associated to the function (in this case the SYS_NC00014$ column):

SQL> select column_name, num_distinct, density, histogram from dba_tab_cols where table_name='DAVID';

COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- ------------ ---------- ---------------
NAME                     10000000          0 HYBRID
MORE_STUFF9                     1  .00000005 FREQUENCY
MORE_STUFF8                     1  .00000005 FREQUENCY
MORE_STUFF7                     1  .00000005 FREQUENCY
MORE_STUFF6                     1  .00000005 FREQUENCY
MORE_STUFF5                     1  .00000005 FREQUENCY
MORE_STUFF4                     1  .00000005 FREQUENCY
MORE_STUFF3                     1  .00000005 FREQUENCY
MORE_STUFF2                     1  .00000005 FREQUENCY
MORE_STUFF10                    1  .00000005 FREQUENCY
MORE_STUFF1                     1  .00000005 FREQUENCY
ID                       10000000          0 HYBRID
CODE                        10000      .0001 HYBRID
SYS_NC00014$                                 NONE

But we need to first collect statistics on this hidden virtual column for the statistics to be populated:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'DAVID', no_invalidate=> false, method_opt=> 'FOR ALL HIDDEN COLUMNS SIZE 1');

SQL> select column_name, num_distinct, density, histogram from dba_tab_cols where table_name='DAVID';

COLUMN_NAME          NUM_DISTINCT    DENSITY HISTOGRAM
-------------------- ------------ ---------- ---------------
NAME                     10000000          0 HYBRID
MORE_STUFF9                     1  .00000005 FREQUENCY
MORE_STUFF8                     1  .00000005 FREQUENCY
MORE_STUFF7                     1  .00000005 FREQUENCY
MORE_STUFF6                     1  .00000005 FREQUENCY
MORE_STUFF5                     1  .00000005 FREQUENCY
MORE_STUFF4                     1  .00000005 FREQUENCY
MORE_STUFF3                     1  .00000005 FREQUENCY
MORE_STUFF2                     1  .00000005 FREQUENCY
MORE_STUFF10                    1  .00000005 FREQUENCY
MORE_STUFF1                     1  .00000005 FREQUENCY
ID                       10000000          0 HYBRID
CODE                        10000      .0001 HYBRID
SYS_NC00014$              9947366          0 HYBRID

Now the CBO has the necessary statistics by which to determine a much more accurate cardinality estimate for the function-based predicate and so potentially a more efficient execution plan:

SQL> select * from david where upper(name) = 'DAVID BOWIE 4242';

Execution Plan
----------------------------------------------------------
Plan hash value: 2675555529

----------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name               | Rows | Bytes | Cost (%CPU) | Time      |
----------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                    |    1 |   200 |       4 (0) | 00:00:01  |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | DAVID              |    1 |   200 |       4 (0) | 00:00:01  |
|* 2 | INDEX RANGE SCAN                    | DAVID_UPPER_NAME_I |    1 |       |       3 (0) | 00:00:01  |
----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(UPPER("NAME")='DAVID BOWIE 4242')

Note
-----
- automatic DOP: Computed Degree of Parallelism is 1

Statistics
----------------------------------------------------------
          1 recursive calls
          0 db block gets
          5 consistent gets
          0 physical reads
          0 redo size
       1256 bytes sent via SQL*Net to client
        381 bytes received via SQL*Net from client
          2 SQL*Net roundtrips to/from client
          0 sorts (memory)
          0 sorts (disk)
          1 rows processed

With the virtual column statistics in place, the CBO now has the cardinality estimate of 1 and associated costs spot on, which is always a good thing.

This requirement to collect the necessary statistics on the associated virtual column created as a result of the function-based index to ensure the index is costed and used effectively is perhaps but one reason why function-based indexes are currently not supported by Automatic Indexing.

As always, this can always change in the future…

Storing Date Values As Characters Part II (A Better Future) May 30, 2016

Posted by Richard Foote in 12c, CBO, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.
5 comments

In the previous post, I discussed how storing date values within a character data type is a really really bad idea and illustrated how the CBO can easily get its costings totally wrong as a result. A function-based date index helped the CBO get the correct costings and protect the integrity of the date data.

During the demo, I re-collected statistics on the table as the associated hidden virtual column after creating the function-based index doesn’t have statistics.

Before re-collecting statistics:

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE';</pre>
 
COLUMN_NAME  NUM_DISTINCT DENSITY     HISTOGRAM      HID VIR
------------ ------------ ---------- --------------- --- ---
SYS_NC00004$                         NONE            YES YES
BOWIE_DATE          10000      .0001 NONE            NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

And afterwards:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');

PL/SQL procedure successfully completed. 

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column 
from dba_tab_cols where table_name='BOWIE'; 

COLUMN_NAME  NUM_DISTINCT DENSITY    HISTOGRAM       HID VIR 
------------ ------------ ---------- --------------- --- --- 
SYS_NC00004$        10000      .0001 NONE            YES YES 
BOWIE_DATE          10000      .0001 HYBRID          NO  NO 
CODE                 1000       .001 NONE            NO  NO 
ID                1000000    .000001 NONE            NO  NO

 

We can see that the hidden virtual column now has statistics.

But we also notice another difference, that being the BOWIE_DATE column now has a histogram (of type Hybrid).

As discussed in the previous post, the issue here is that the date data within the character column covers only a very specific subset of all the potential character values that could reside within the column. Therefore the CBO is getting the range scan selectivity hopelessly incorrect.

Now that we’ve run a few queries featuring the BOWIE_DATE column in the predicates and as there’s effectively data skew within the column, the column becomes a candidate for a histogram with the default SIZE AUTO collection method.

The histogram now provides the CBO with a much more accurate picture of the distribution of the data within the BOWIE_DATE and that between discrete “date” column values, there are only so many rows that qualify.

As a result of the histogram, the CBO can now make much more accurate cardinality estimates.

If we now re-run the query that actually returns 8300 rows but the CBO previously estimated only 100 rows be returned:

SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31'
                                             
8300 rows selected.
                                                            
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 4152 | 83040 |   1000 (12) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 4152 | 83040 |   1000 (12) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31')
Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
4063 consistent gets
0 physical reads
0 redo size
282075 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
8300 rows processed

We see that at an estimated 4,152 rows, it’s a much better estimate. Not perfect, but maybe good enough to now get the more efficient Full Table Scan execution plan.

If we re-run the query that returned over 1/2 the table at some 570,000 rows but with the CBO previously estimating only 116 rows:

SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31';</pre>
  
570800 rows selected.
    
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 572K |   10M |   1012 (13) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 572K |   10M |   1012 (13) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31')
   
Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
41456 consistent gets
4 physical reads
0 redo size
19292352 bytes sent via SQL*Net to client
419135 bytes received via SQL*Net from client
38055 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
570800 rows processed

 

We see that at an estimate of 572K rows, it’s now got this just about right and again has made the right decision with the Full Table Scan execution plan.

Storing date data in character based columns is still a really really bad idea and limits the manner in which date data can be analysed, protected and accessed, but with appropriate histograms in place, at least the CBO has some chance of making a reasonable fist of things with some range based queries.

As a follow-up, I was asked if storing dates in NUMBER format is a better option than as a string. I’ll discuss that next.

Storing Date Values As Characters (What’s Really Happening) May 26, 2016

Posted by Richard Foote in 12c, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.
1 comment so far

For something that’s generally considered an extremely bad idea, I’ve lost count of the number of times I’ve come across applications that insist on storing date values as characters within the database. We’ve all seen them …

I recently got called in to assist a customer who was having issues with a POC in relation to the database not appearing to want to use the In-Memory Database option as expected. In various key scenarios, the CBO kept coming up with execution plans that used index plans (they were hesitant to drop these particular indexes), when if it only just used the Database In-Memory store, the queries ran so much faster. So I was called in to find out what’s really happening and it turned out that the main culprit was indeed queries against columns where dates were stored as characters within the database. In the process, we found another issue with some “invalid” date values. Go figure.

Interestingly, both issues could be addressed by creating a new index on the date column …

I’ve kinda replicated the scenario here with the following little test case. I’ll begin by creating a table with a varchar2 field (bowie_date) that actually stores a whole bunch of “dates”:

SQL> create table bowie (id number, code number, bowie_date varchar2(42));
Table created

SQL> insert into bowie select rownum, mod(rownum,1000), to_char(sysdate-mod(rownum,10000), 'YYYY MM DD') from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

So the bowie_date column basically has 10000 different dates, with each date evenly distributed with 100 occurrences for each date.

I’ll now collect statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');
PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  NUM_DISTINCT DENSITY          HISTOGRAM HID VIR
------------ ------------ ---------- --------------- --- ---
BOWIE_DATE          10000      .0001 NONE            NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

 

So the bowie_date column indeed has 10000 distinct dates.

I’ll now create a standard b-tree index on this column:

SQL> create index bowie_date_i on bowie(bowie_date);
Index created.

So the data looks as follows with the bowie_date a varchar2 column that actually contains date data, with the following min/max ranges:

SQL> select * from bowie where rownum <11;

        ID       CODE BOWIE_DATE
---------- ---------- ----------
       916        916 2013 11 22
       917        917 2013 11 21
       918        918 2013 11 20
       919        919 2013 11 19
       920        920 2013 11 18
       921        921 2013 11 17
       922        922 2013 11 16
       923        923 2013 11 15
       924        924 2013 11 14
       925        925 2013 11 13

10 rows selected.

SQL> select min(bowie_date) min, max(bowie_date) max from bowie;

MIN        MAX
---------- ----------
1989 01 09 2016 05 26

If we run a query that uses an equality predicate as follows:

SQL> select * from bowie where bowie_date = '2015 06 13';
100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                             | Name         | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                      |              |  100 |  2000 |     103 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED   | BOWIE        |  100 | 2000  |     103 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                      | BOWIE_DATE_I |  100 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE"='2015 06 13')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
110 consistent gets
16 physical reads
0 redo size
4383 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed 

The CBO gets things just about right. 100 rows are estimated and indeed 100 rows are retrieved. So we have confidence the CBO has made the right decision in using the index here as so few rows are actually retrieved.

However, if we run a range scan predicate such as the following:

SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31';

8300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                          | Name         | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                   |              |  100 | 2000  |     104 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE        |  100 | 2000  |     104 (0)| 00:00:01 |
|*2  | INDEX RANGE SCAN                   | BOWIE_DATE_I |  100 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
8881 consistent gets
38 physical reads
0 redo size
315219 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
8300 rows processed

The CBO has got the costings wrong here. It still estimates  only 100 rows are to be returned when in actual fact 8300 rows come back.

If we select an even larger “date” range:

SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31';

570800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                          | Name         | Rows | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                   |              |  116 |  2320 |     120 (0)| 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE        |  116 |  2320 |     120 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                   | BOWIE_DATE_I |  116 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
610491 consistent gets
1704 physical reads
0 redo size
21575496 bytes sent via SQL*Net to client
419135 bytes received via SQL*Net from client
38055 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
570800 rows processed

The CBO has got things seriously wrong here. We’re actually returning over 1/2 the table, some 570,800 rows but the CBO thinks only 116 rows will be returned. Why ?

The problem comes back to storing date values as characters. The CBO has absolutely no idea that these “characters” are actually meant to be dates and has no idea that the only valid ranges of values are date values.

With a possible range between “1989 01 09” and “2016 05 26“, any character range/combination of values (up to the 42 column size) could potentially exist in this column ( value ‘1zxgs$.jKN6tglhasgdlhlhd23bkbk?k’ for example).

So the required range between ‘2000 10 10’ and ‘2016 12 31’ actually represents a relatively narrow range of possible values within the range of all possible values (especially as the leading column differs between the min/max).

Hence why the CBO is estimating such a low number of rows to be returned and hence why the CBO is deciding to incorrectly use the index. “Hiding” the meaning and distribution of values from the CBO in this manner can be problematic to say the least.

Worse of course is also the possibility of “invalid” dates being entered as the database has no implicit way to police the integrity of the data:

SQL> insert into bowie values (1000001, 42, '2016 6 31');

1 row created.

SQL> rollback;

Rollback complete.

We know there’s no such date as 31st June but Oracle has no idea that this is logically invalid data. Or the value ‘lh;ghsgdsd7gdGLH96bb’ for that matter …

Did I mention that storing dates in a character column is a really really bad idea …

Now there are a couple of ways to help address these issues if changing the column and its datatype is not possible due to the application. One way is to create a function-based index as follows on a date version of the column:

SQL> create index bowie_date_fn_i on bowie(to_date(bowie_date,'YYYY MM DD'));

Index created.

If we run the equality query but this time referencing the to_date function:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015';

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 960797537

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows  | Bytes  | Cost (%CPU)|     Time |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 | 10000 |   273K |     893 (2)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE           | 10000 |   273K |     893 (2)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | BOWIE_DATE_FN_I |  4000 |        |       3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
12 recursive calls
0 db block gets
113 consistent gets
16 physical reads
0 redo size
3268 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

We notice the CBO has got the row estimate way wrong here, thinking that 10000 rows, not 100 rows are to be returned. Why ?

Remember, when we create a function-based index, Oracle creates a hidden virtual column on the table as Oracle needs a way to store the statistics associated with the result set from the function. But these statistics aren’t populated until we next collect statistics on the table (or explicitly for just the hidden columns) and without the column statistics, the CBO can make poor assumptions:

SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col
umn, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  DATA_TYPE NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ --------- ------------ ---------- --------------- --- ---
SYS_NC00004$ DATE                              NONE            YES YES
BOWIE_DATE   VARCHAR2         10000      .0001 NONE            NO  NO
CODE         NUMBER            1000       .001 NONE            NO  NO
ID           NUMBER         1000000    .000001 NONE            NO  NO

There are no column statistics for the virtual column but we note the data type of the virtual column is DATE. So let’s collect new statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');

PL/SQL procedure successfully completed.
         .
SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col
umn, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  DATA_TYPE NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ --------- ------------ ---------- --------------- --- ---
SYS_NC00004$ DATE             10000      .0001 NONE            YES YES
BOWIE_DATE   VARCHAR2         10000      .0001 HYBRID          NO  NO
CODE         NUMBER            1000       .001 NONE            NO  NO
ID           NUMBER         1000000    .000001 NONE            NO  NO

There are actually two key differences in the above statistics, but in this post I’ll just focus on the fact that the hidden virtual column now has associated statistics (I’ll discuss the other key difference in my next blog post).

When we re-run the query:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015';

100 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 960797537

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  100 |  2800 |     102 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE           |  100 |  2800 |     102 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | BOWIE_DATE_FN_I |  100 |       |       3 (0) | 00:00:01 |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
108 consistent gets
0 physical reads
0 redo size
3268 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

The CBO has now got the estimate spot on.

If we now run the previous range scan query that accessed 1/2 the table, referencing the to_date function:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2000' and '31-DEC-2015';

556100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       |  556K |    14M |   1328 (34)| 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE |  556K |    14M |   1328 (34)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2000-10-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM
DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
40444 consistent gets
0 physical reads
0 redo size
18804277 bytes sent via SQL*Net to client
408355 bytes received via SQL*Net from client
37075 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
556100 rows processed

The CBO has got the estimates pretty well spot on and is now performing the far more efficient Full Table Scan. The CBO knows that the virtual column is of type DATE and therefore can much more accurately determine the actual cardinality estimates for the range scan on the “date” column.

If we now run the other corresponding range scan that returned a moderate number of rows:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2015' and '31-DEC-2015';

8300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes  | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 8401 |   229K |   1205 (27) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 8401 |   229K |   1205 (27) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2015-10-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM
DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
4058 consistent gets
0 physical reads
0 redo size
282075 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
8300 rows processed

We notice that again the CBO has got the row estimate just about right and because the index has a poor clustering factor, the CBO still decided to go down the Full Table Scan path.

Even if we don’t use the index all that much (perhaps the Database In-Memory store is a better option for most queries) , it can still play an important role in policing the integrity of the data. An attempt to insert an invalid date will now automatically be captured by the database and fail:

SQL> insert into bowie values (1000001, 42, '2016 6 31');
insert into bowie values (1000001, 42, '2016 6 31')
*
ERROR at line 1:
ORA-01839: date not valid for month specified

As the function-based index can only be populated or updated if a valid date is inserted into the table, any attempt to insert an invalid date will fail thus protecting the integrity of the data.

The best “fix” here is to store these dates in a date field within the database, where the above issues are automatically addressed. If this is not possible, then the introduction and usage of an associated function-based index can certainly assist the CBO in making the correct decision.

However, if data integrity is not a concern (the application does a fantastic job of it and no one ever has to perform manual data fixes directly in the database), then there’s another option to help make the CBO do the right thing.

The clue is back in how I collected the table statistics, which I’ll discuss in my next post.

12c Indexing Extended Data Types Part II (15 Steps) November 14, 2013

Posted by Richard Foote in 12c, Extended Data Types, Function Based Indexes, Oracle Indexes.
2 comments

Finally, at long long last, I have a spare 30 minutes in my life to complete this blog entry !!

As discussed previously, Oracle has extended the maximum length of varchar2, nvarchar and raw columns to 32K, but this comes with some challenges when it comes to indexing such columns due to restrictions on the maximum length of an index entry.

A function-based index on the hash of the column value as previously demonstrated can be used for equality based predicates but not for ranged based requirements.

If index accesses are required for ranged based predicates, then a simple sub-string function-based index can be considered. Using the same set-up and demo as in Part I, let’s create a function-based index that stores the first (say) 1000 characters of an extended data type column. This should provide more than enough detail of the column contents to be sufficiently selective in most practical scenarios.

SQL> create index bowie_substr_text_i on bowie(substr(text,1,1000));

Index created.

SQL> select index_name, num_rows, leaf_blocks from dba_indexes where index_name = 'BOWIE_SUBSTR_TEXT_I';

INDEX_NAME             NUM_ROWS LEAF_BLOCKS
-------------------- ---------- -----------
BOWIE_SUBSTR_TEXT_I      100000         306

Such a substr function-based index is viable not only with equality based predicates:

SQL> select * from bowie where text = '42BOWIE';

Execution Plan
----------------------------------------------------------

Plan hash value: 4016785672
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                     |     1 |    16 |    80   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE               |     1 |    16 |    80   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | BOWIE_SUBSTR_TEXT_I |   400 |       |     1   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT"='42BOWIE')
2 - access(SUBSTR("TEXT",1,1000)='42BOWIE')

Statistics
----------------------------------------------------------

0  recursive calls
0  db block gets
4  consistent gets
0  physical reads
0  redo size
610  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
1  rows processed

But unlike the hash function-based index in my previous post, it can also be considered in a range (pun fully intended) of ranged-based predicates as well, for example:

SQL> select * from bowie where text between '4299BOWIE' and '42BOWIE';

Execution Plan
----------------------------------------------------------

Plan hash value: 4016785672
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                     |     2 |    32 |    92   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE               |     2 |    32 |    92   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | BOWIE_SUBSTR_TEXT_I |   450 |       |     3   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT"<='42BOWIE' AND "TEXT">='4299BOWIE')
2 - access(SUBSTR("TEXT",1,1000)>='4299BOWIE' AND SUBSTR("TEXT",1,1000)<='42BOWIE')

Statistics
----------------------------------------------------------

0  recursive calls
0  db block gets
6  consistent gets
0  physical reads
0  redo size
693  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
3  rows processed
SQL> select * from bowie where text > 'C';

Execution Plan
----------------------------------------------------------

Plan hash value: 4016785672
-----------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                     | 63040 |  2277K|   181   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE               | 63040 |  2277K|   181   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | BOWIE_SUBSTR_TEXT_I |   900 |       |     4   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT">'C')
2 - access(SUBSTR("TEXT",1,1000)>='C')

Statistics
----------------------------------------------------------

0  recursive calls
0  db block gets
10  consistent gets
0  physical reads
0  redo size
12884  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
2  rows processed

12c Indexing Extended Data Types Part I (A Big Hurt) September 12, 2013

Posted by Richard Foote in 12c, Extended Data Types, Function Based Indexes, Oracle Indexes, Unique Indexes.
12 comments

The maximum size for VARCHAR2, NVARCHAR and RAW columns has been extended to 32767 bytes with the Oracle 12c Database. However, indexing such columns with standard indexes comes with some challenges.

These extended data types are not enabled by default within the database but can easily be done so by following these steps:

  1. Restart the database in UPGRADE mode
  2. Change the setting of MAX_STRING_SIZE to EXTENDED
  3. Run the rdbms/admin/utl32k.sql script as sysdba
  4. Restart the database

We can now create a table with a larger than 4000 byte VARCHAR2 column (Note such larger column values are actually stored out of line from the rest of the table, I might discuss this in another post) :

SQL> create table bowie (id number, text varchar2(32000));
Table created.

However, if we try now to create an index on such a column:

SQL> create index bowie_text_i on bowie(text);
create index bowie_text_i on bowie(text)
 *
ERROR at line 1:
ORA-01450: maximum key length (6398) exceeded

We find Oracle complains that the possible index length is going to be too large for my (8K) block sized index. So, is it possible to index such extended columns ?

Let’s populate this table with some data:

SQL> insert into bowie (id, text) values (1, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa');
1 row created.

SQL> commit;

Commit complete.

SQL> select length(text) from bowie;

LENGTH(TEXT)
------------
        1110

SQL> insert into bowie (id, text)
     select 2, text||text||text||text||text||text||text||text||text||text
     from bowie;

1 row created.

SQL> commit;

Commit complete.

SQL> select length(text) from bowie;

LENGTH(TEXT)
------------
        1110
       11100

SQL> insert into bowie (id, text)
     select rownum+2, to_char(rownum)||'BOWIE'
     from dual connect by level<=99998;

99998 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'BOWIE', method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

So yes, we definitely have at least one very large Text value (some 11100 bytes) in our table. How cool. One method of creating a valid index on this extended column is to use a function-based index based on a hash value of this column. For example:

SQL> create index bowie_hash_text_i on bowie(standard_hash(text));

Index created.

SQL> select index_name, num_rows, leaf_blocks from dba_indexes where index_name = 'BOWIE_HASH_TEXT_I';

INDEX_NAME             NUM_ROWS LEAF_BLOCKS
-------------------- ---------- -----------
BOWIE_HASH_TEXT_I        100000         447

This index can now be used effectively for subsequent equality based predicates, for example:

SQL> select * from bowie where text = '42BOWIE';

Execution Plan
----------------------------------------------------------

Plan hash value: 1900956348
---------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name              | Rows  | Bytes| Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                   |     1 |    16|   203   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE             |     1 |    16|   203   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | BOWIE_HASH_TEXT_I |   400 |      |     1   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT"='42BOWIE')
2 - access(STANDARD_HASH("TEXT")=HEXTORAW('A2C98939EDB479BC3EB0CDC560DDCD1575D47F62'))

Statistics
----------------------------------------------------------
0  recursive calls
0  db block gets
4  consistent gets
0  physical reads
0  redo size
610  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)

1  rows processed

So the index has been used to very efficiently retrieve data based on an equality predicate on the extended TEXT column.

However, range based predicates are problematic as Oracle has no easy way to find and retrieve all such data via the index when the data in the index is effectively randomised hashed values. For example:

SQL> select * from bowie where text like 'aaaaaaaaaaaaaaaaaaaaaa%';

Execution Plan
----------------------------------------------------------

Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |    16 |   208   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     1 |    16 |   208   (2)| 00:00:01 |
---------------------------------------------------------------------------

SQL> select * from bowie where text between '4299BOWIE' and '42BOWIE';

Execution Plan
----------------------------------------------------------

Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     2 |    32 |   208   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     2 |    32 |   208   (2)| 00:00:01 |
---------------------------------------------------------------------------

SQL> select * from bowie where text > 'zzz';

Execution Plan
----------------------------------------------------------

Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |    17 |   219   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     1 |    17 |   219   (2)| 00:00:01 |
---------------------------------------------------------------------------

The above are all examples of predicates that can’t use our hash based function-based index, even though the CBO is estimating very few rows to be returned.

If we try now to make this extended column unique via a constraint:

SQL> alter table bowie add constraint bowie_text_unq unique (text);
alter table bowie add constraint bowie_text_unq unique (text)
*
ERROR at line 1:

ORA-01450: maximum key length (6398) exceeded

We hit our problem again. Oracle tries to make a unique index on the Text column, but it can’t because the extended column definition could potentially exceed the maximum allowable key length.

We can get around this in a similar fashion, but by adding a virtual hash column to the table and basing the Unique constraint on this column instead:

SQL> drop index bowie_hash_text_i;

Index dropped.

SQL> alter table bowie add (text_hash as (standard_hash(text)));

Table altered.

SQL> alter table bowie add constraint bowie_text_unq unique (text_hash);

Table altered.

This can now be used to effectively protect the uniqueness of the original Text column:

SQL> insert into bowie (id, text) values (1000001, '42BOWIE');
insert into bowie (id, text) values (1000001, '42BOWIE')
*
ERROR at line 1:

ORA-00001: unique constraint (BOWIE.BOWIE_TEXT_UNQ) violated

This index can now be used in a similar manner as before for equality based predicates:

SQL> select * from bowie where text = '42BOWIE';

ID  TEXT       TEXT_HASH
--- ---------- ----------------------------------------
44     42BOWIE A2C98939EDB479BC3EB0CDC560DDCD1575D47F62

Execution Plan
----------------------------------------------------------

Plan hash value: 2691947611
----------------------------------------------------------------------------------------------
| Id  | Operation                   | Name           | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |                |     1 |    16 |     2   (0)| 00:00:01 |
|*  1 |  TABLE ACCESS BY INDEX ROWID| BOWIE          |     1 |    16 |     2   (0)| 00:00:01 |
|*  2 |   INDEX UNIQUE SCAN         | BOWIE_TEXT_UNQ |     1 |       |     1   (0)| 00:00:01 |
----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT"='42BOWIE')
2 - access("BOWIE"."TEXT_HASH"=HEXTORAW('A2C98939EDB479BC3EB0CDC560DDCD1575D47F62'))

Statistics
----------------------------------------------------------
0  recursive calls
0  db block gets
3  consistent gets
0  physical reads
0  redo size
702  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
1  rows processed

But with the same restrictions with range based predicates:

SQL> select * from bowie where text between '429BOWIE' and '42BOWIE';

ID  TEXT       TEXT_HASH
--- ---------- ----------------------------------------
44     42BOWIE A2C98939EDB479BC3EB0CDC560DDCD1575D47F62
431   429BOWIE A7E2B59E1429DB4964225E7A98A19998BC3D2AFD

Execution Plan
----------------------------------------------------------

Plan hash value: 1845943507
---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     2 |    32 |   208   (2)| 00:00:01 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     2 |    32 |   208   (2)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
1 - filter(INTERNAL_FUNCTION("TEXT") AND "TEXT"<='42BOWIE' AND "TEXT">='429BOWIE')

Statistics
----------------------------------------------------------
0  recursive calls
0  db block gets
758  consistent gets
0  physical reads
0  redo size
787  bytes sent via SQL*Net to client
544  bytes received via SQL*Net from client
2  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
2  rows processed

I’ll look at other indexing options with these new extended columns in Part II.

11g Virtual Columns and Fast Refreshable Materialized Views (What In The World) November 24, 2010

Posted by Richard Foote in 11g, 11g New features, Function Based Indexes, Oracle Bugs, Virtual Columns.
15 comments

Previous to Oracle 11g Rel 2, two very common and useful features previously worked well together, they being fast refreshable materialized views and the introduction of virtual columns due to the creation of function-based indexes.
 
To illustrate, we create and populate a little demo table: 

 
SQL> create table bowie (a number, b number, c number);
 
Table created.
 
SQL> insert into bowie select rownum, mod(rownum,10), mod(rownum,100) from dual connect by level <= 100000;
 
100000 rows created.
 
SQL> commit;
 
Commit complete.
 
SQL> alter table bowie add primary key (a);
 
Table altered.

 
We now create a simple little function-based index:
 

 
SQL> create index bowie_func_i on bowie(b+c);
 
Index created.

 
 
If we look at the columns in the table via DBA_TAB_COLS:
 

 
SQL> select column_name, data_default, virtual_column, hidden_column from dba_tab_cols where table_name = 'BOWIE';
 
COLUMN_NAME  DATA_DEFAULT VIR HID
------------ ------------ --- ---
SYS_NC00004$ "B"+"C"      YES YES
C                         NO  NO
B                         NO  NO
A                         NO  NO

 

We notice Oracle has introduced a new, hidden virtual column (SYS_NC00004$), required to store statistics for use by the Cost Based Optimizer.
 
Next we create a materialized view log on this table and a fast refreshable materialized view:
 

 
SQL> create materialized view log on bowie WITH PRIMARY KEY,SEQUENCE, ROWID (b,c) INCLUDING NEW VALUES;
 
Materialized view log created.
 
SQL> create materialized view bowie_mv
  2  build immediate
  3  refresh fast
  4  with primary key
  5  enable query rewrite
  6  as
  7  select b, count(*) from bowie group by b;
 
Materialized view created.

 
 
Collect a few statistics and we note the Materialized View does indeed get used during a query rewrite scenario:
 

 
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE', estimate_percent=>null, cascade=> true, method_opt=> 'FOR ALL COLUMNS SIZE 1')
 
PL/SQL procedure successfully completed.
 
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE_MV', estimate_percent=>null, cascade=> true, method_opt=> 'FOR ALL COLUMNS SIZE 1')
 
PL/SQL procedure successfully completed. 

SQL> select b, count(*) from bowie having b > 3 group by b;
 
         B   COUNT(*)
---------- ----------
         6      10000
         4      10000
         5      10000
         8      10000
         7      10000
         9      10000
 
6 rows selected.
 

Execution Plan
----------------------------------------------------------
Plan hash value: 593592962
 
-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     7 |    42 |     2   (0)| 00:00:01 |
|*  1 |  MAT_VIEW REWRITE ACCESS FULL| BOWIE_MV |     7 |    42 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
 

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("BOWIE_MV"."B">3)
 

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        538  bytes sent via SQL*Net to client
        395  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          6  rows processed

 
 

And indeed the materialized view is fast refreshable:
 

 
SQL> insert into bowie values (100001, 5, 42);
 
1 row created.
 
SQL> commit;
 
Commit complete.
 
SQL> exec dbms_mview.refresh('BOWIE_MV', 'F');
 
PL/SQL procedure successfully completed.
 
 
SQL> select b, count(*) from bowie having b > 3 group by b;
 
         B   COUNT(*)
---------- ----------
         6      10000
         4      10000
         5      10001
         8      10000
         7      10000
         9      10000
 
6 rows selected.
 

Execution Plan
----------------------------------------------------------
Plan hash value: 593592962
 
-----------------------------------------------------------------------------------------
| Id  | Operation                    | Name     | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT             |          |     7 |    42 |     2   (0)| 00:00:01 |
|*  1 |  MAT_VIEW REWRITE ACCESS FULL| BOWIE_MV |     7 |    42 |     2   (0)| 00:00:01 |
-----------------------------------------------------------------------------------------
 

Predicate Information (identified by operation id):
---------------------------------------------------
 
   1 - filter("BOWIE_MV"."B">3)
 

Statistics
----------------------------------------------------------
          0  recursive calls
          0  db block gets
          4  consistent gets
          0  physical reads
          0  redo size
        546  bytes sent via SQL*Net to client
        395  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
          6  rows processed

 
Notice how the materialized view does indeed displayed the correct updated information via the query rewrite operation . So the materialized view behaved and worked as expected even though the underlining master table has a virtual column due to the creation of the function-based index (note that QUERY_REWRITE_INTEGRITY is set to STALE_TOLERATED)
 
Unfortunately, things go off the rails somewhat since Oracle 11g Rel 2 when a virtual column is introduced due to one of the 11g new features. For example, I now collect some Extended Statistics on this table:
 

 
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE', method_opt=> 'FOR COLUMNS (A,B,C) SIZE 254');
 
PL/SQL procedure successfully completed.
 
SQL> select column_name, data_default, virtual_column, hidden_column from dba_tab_cols where table_name = 'BOWIE';
 
COLUMN_NAME                    DATA_DEFAULT                      VIR HID
------------------------------ --------------------------------- --- ---
SYS_STUM4KJU$CCICS9C1UJ6UWC4YP SYS_OP_COMBINED_HASH("A","B","C") YES YES
SYS_NC00004$                   "B"+"C"                           YES YES
C                                                                NO  NO
B                                                                NO  NO
A                                                                NO  NO

 
 
Notice how extended statistics has resulted in another hidden virtual column (SYS_STUM4KJU$CCICS9C1UJ6UWC4YP) being created to store the resultant statistics.
 
However, if now attempt to perform a fast refresh on the Materialized View:

 
SQL> insert into bowie values (100002, 5, 42);
 
1 row created.
 
SQL> commit;
 
Commit complete.
 
SQL> exec dbms_mview.refresh('BOWIE_MV', 'F');
BEGIN dbms_mview.refresh('BOWIE_MV', 'F'); END;
 
*
ERROR at line 1:
ORA-12008: error in materialized view refresh path
ORA-00904: "MAS$"."SYS_STUM4KJU$CCICS9C1UJ6UWC4YP": invalid identifier
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2558
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2771
ORA-06512: at "SYS.DBMS_SNAPSHOT", line 2740
ORA-06512: at line 1

 

We get an error, complaining about the existence of this new virtual column.
 
If we attempted to drop and re-create the materialized view:

 
SQL> drop materialized view bowie_mv;
 
Materialized view dropped.
 
SQL> create materialized view bowie_mv
  2  build immediate
  3  refresh fast
  4  with primary key
  5  enable query rewrite
  6  as
  7  select b, count(*) from bowie group by b;
select b, count(*) from bowie group by b
                        *
ERROR at line 7:
ORA-12033: cannot use filter columns from materialized view log on "BOWIE"."BOWIE"

 

It fails, complaining that the materialized view log is somehow missing a filter column (which it isn’t).  We get exactly the same set of issues if we add a visible virtual column via this new 11g capability: 
 
 

 
SQL> create table bowie2 (a number, b number, c number, d as (a+b+c));
 
Table created.
 
SQL> select column_name, data_default, virtual_column, hidden_column from dba_tab_cols where table_name = 'BOWIE2';
 
COLUMN_NAME  DATA_DEFAULT VIR HID
------------ ------------ --- ---
D            "A"+"B"+"C"  YES NO
C                         NO  NO
B                         NO  NO
A                         NO  NO
 
SQL> insert into bowie2 (a,b,c) select rownum, mod(rownum,10), mod(rownum,100) from dual connect by level <= 100000;
 
100000 rows created.
 
SQL> commit;
 
Commit complete.
 
SQL> alter table bowie2 add primary key (a);
 
Table altered.
 
SQL> create materialized view log on bowie2 WITH PRIMARY KEY,SEQUENCE, ROWID (b,c) INCLUDING NEW VALUES;
 
Materialized view log created.
 
SQL> create materialized view bowie2_mv
  2  build immediate
  3  refresh fast
  4  with primary key
  5  enable query rewrite
  6  as
  7  select b, count(*) from bowie2 group by b;
select b, count(*) from bowie2 group by b
                        *
ERROR at line 7:
ORA-12033: cannot use filter columns from materialized view log on "BOWIE"."BOWIE2"

 
Extended statistics and visible virtual columns are both potentially extremely useful new features introduced in 11g but unfortunately both can not be implemented on any table that needs to be fast refreshable within a complex materialized view.

I raised this issue with Oracle Support who have raised bug 10281402 as a result as it occurs in both 11.2.0.1 and 11.2.0.2 on various platforms I’ve tested.

How To Really Create A Function-Based Index (A Day In The Life) July 12, 2009

Posted by Richard Foote in Function Based Indexes, Humour, Oracle Indexes.
21 comments

“I heard the news today, oh boy” (John Lennon 1967).

It was with some amusement that someone pointed out Don Burleson’s latest 11 July 2009 “Oracle News” piece : “How to index on a SQL CASE Statement” (the article it links to is dated 29 December 2008).

It must obviously be a very very slow news day as the ability to create a function-based index has been around for a long time, the SQL CASE “statement” (expression actually) being around since at least Oracle8i days. So it’s not exactly “new” news.

However, what’s particularly amusing in this little “news” piece, is the SQL statement used to demonstrate how to create an index on a SQL CASE statement:

create index
case_index as
(case SOURCE_TRAN
when ‘PO’ then PO_ID
when ‘VOUCHER’ then voucher_id
ELSE journal_id
end = ‘0000000001’
END);

No, this is not some new weird piece of SQL. The problem of course is that this most definitely is NOT how to create a function-based index with a CASE statement. It’s fundamentally syntactically totally wrong, not just in one place, but in several places.

If you were to just TEST this SQL, you would get the following error:

SQL> create table case_tab (source_tran varchar2(5), po_id varchar2(5), voucher_id varchar2(5), journal_id varchar2(5));

Table created.

SQL> create index case_index as (case source_tran when ‘PO’ then po_id when ‘VOUCHER’ then voucher_id else journal_id end = ‘00001’ end);
create index case_index as (case source_tran when ‘PO’ then po_id when ‘VOUCHER’ then voucher_id else journal_id end = ‘00001’ end)
                        *
ERROR at line 1:
ORA-00969: missing ON keyword

 

You see, to create an index, you must at the very least need to specify the table that is being indexed …

Additionally, the AS statement is incorrect and the CASE statement itself is wrong and incorrectly specifies an additional END condition. There’s a lot wrong here.

If you really want to create such an index, then you need the SQL syntax to at least look something like this:

SQL> create index case_tab_idx on case_tab (case source_tran when ‘PO’ then po_id when ‘VOUCHER’ then voucher_id else journal_id end);

Index created.

 

The “News” article then suggests you need to collect statistics on the index:

EXEC DBMS_STATS.gather_index_stats(‘OWNER’, ‘CASE_INDEX’);

However, this advice is again not quite correct. Remember, this “News” piece is only a day or two old and Oracle has been automatically collecting statistics on indexes as they’re created (or rebuilt) since Oracle 10g. Providing the table already has statistics, Oracle will automatically collect statistics on the index as it’s being created. There is therefore no need to collect index statistics again after the index has been created.

But but but, what does not have statistics and what really should be collected after you create any new function-based index are the statistics on the virtual column that is created on the parent table as part of creating any function-based index. This is vitally important, else the function-index may not be used by the CBO as expected. For example:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> ‘CASE_TAB’, estimate_percent=>null, cascade=>true, method_opt=> ‘FOR ALL HIDDEN COLUMNS SIZE 1’);

I discuss this issue in some detail the post “Function-Based Indexes And Missing Statistics“.

So some advice in summary:

1) If you’re going to create a function-based index, get the syntax correct. It helps,  it really does.

2) Ensure you collect the associated statistics on the table virtual column that is created automatically behind the scenes.

3) Don’t believe everything you read or hear, even if it’s on the “news” …  🙂

 

UPDATE: 13 July 2009: The “News” article has now been amended to expand the discussion on statistics and reference the dbms_stats.gather_table_stats procedure for collecting hidden column statistics on the function-based index virtual columns as I discussed. The article is still dated 29 December 2008.

I also notice the newly added gathering hidden column statistics example used in both the news article and in the referenced “important notes” is simply a cut ‘n’ paste from this very blog post. I can easily tell because:

  1. It uses the same table_name as I made up in my example (CASE_TAB)
  2. It has the same error as in my original version with the missing quote when defining the table_name 😉

Function-Based Indexes and Missing Statistics (No Surprises) December 4, 2008

Posted by Richard Foote in Function Based Indexes, Oracle Cost Based Optimizer, Oracle Indexes.
9 comments

Thought I might mention a common trap associated with Function-Based Indexes following a recent question on the OTN Database Forum.

First of all, considered this “simple” little scenario. You have a column in a table that’s a standard VARCHAR2 field and it has 100 distinct values. Therefore, if you select just one value, assuming even distribution of data, you’ll select approximately 1% of the data.

However, this table can potentially allow mixed case fields, with values such as ‘Bowie’, ‘bowie’, ‘BowiE’, etc.

Therefore, you decide to write a query that first converts all the fields to UPPER case such that you can now compare all possible values of a specific name to say ‘BOWIE’.

Remembering you previously had a selectivity of 1%, what’s the selectively of the column after you’ve converted everything to upper case ?

Ummmm, actually, there’s no easy answer to that. It all depends on how many actual different names you have in your data. In theory, you may now have just 1 distinct value or you may have 100 distinct values, you can’t really tell.

However, what if the function did something more complex and converted the name to some numerical value based on the sum of all the characters via some weird formula. In theory, you could have no distinct values in that everything could get converted to a NULL or you could possibly have as many distinct values as there are rows in the table.

Again, Oracle can’t really tell how many rows may get selected based on the determination of such a function.

When a predicate is used with a function call, Oracle has great difficulty in determining the correct selectivity. Therefore when you create a Function-Based Index, which Oracle assumes you may wish to access, Oracle behind the scenes creates a hidden virtual column on the parent table in order to capture the data characteristics of the function so that the CBO can make an accurate determination of the selectivity associated with using the function.

If Oracle knows the low value, the high value and the number of distinct values of the data associated with the function on a column, it can then accurately determine the associated selectivity and cardinality when the function is used on the column (assuming even distribution of data) and hence calculate an accurate cost and determine whether the use of the function-based index is appropriate.

However, and here comes the trap, when a function-based index is created, Oracle will now (since 10g) automatically calculate the statistics associated with the index (such as the blevel, number of leaf blocks, clustering factor, etc.) but it will NOT calculate the statistics associated with the hidden virtual column as these statistics are associated with the parent table, not directly with the index itself.

Here I create a new function-based index on a table:

SQL> create index pink_floyd_upp_tab_name_i on pink_floyd(upper(table_name));

 

Index created.

 

But there are still no statistics on the hidden virtual column created by Oracle behind the scenes.

 

SQL> select column_name, num_distinct, hidden_column, virtual_column from dba_tab_cols where table_name=’PINK_FLOYD’;

 

COLUMN_NAME     NUM_DISTINCT HID VIR
--------------- ------------ --- ---
OWNER                     74 NO  NO
TABLE_NAME              5739 NO  NO
NUM_ROWS                 886 NO  NO
BLOCKS                   152 NO  NO
SYS_NC00005$                 YES YES

Therefore, even after you’ve created the function-based index, Oracle still has no idea on the selectivity associated the function because the necessary virtual column statistics are still missing. Oracle simply takes a “guess” and as with most guesses, it’s quite likely to be wrong which means the selectivity is likely to be wrong which means the costings is likely to be wrong which means the execution plan could very well be wrong as well.

What should you do after you create a function-based index ? You should collect the statistics on the hidden virtual column so that the CBO can accurately determine the cardinality associated with using the function and so make correct decisions regarding the most appropriate execution plan.

The ‘FOR ALL HIDDEN COLUMNS’ method_opt option with DBMS_STATS is one way of collecting just the necessary table statistics on these hidden virtual columns:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> ‘PINK_FLOYD’, estimate_percent=>null, cascade=>true, method_opt=> ‘FOR ALL HIDDEN COLUMNS SIZE 1′);

PL/SQL procedure successfully completed.
SQL> select column_name, num_distinct, hidden_column, virtual_column from dba_tab_cols where table_name=’PINK_FLOYD’;

COLUMN_NAME     NUM_DISTINCT HID VIR
--------------- ------------ --- ---
OWNER                     74 NO  NO
TABLE_NAME              5739 NO  NO
NUM_ROWS                 886 NO  NO
BLOCKS                   152 NO  NO
SYS_NC00005$            5739 YES YES

This demo on missing virtual column statistics asscociated with function-based indexes goes through this whole issue.

My early Christmas wish to Oracle would be for the CBO to at least consider the DISTINCT_KEYS index statistic when determining the correctly selectivity when hidden virtual column statistics are missing on function-based indexes, rather than take it’s 0.4% / 1% guess …

More on virtual columns, hidden and otherwise, in the future …

Index Only Values Of Interest: (Little Wonder) January 28, 2008

Posted by Richard Foote in Function Based Indexes, Indexing Tricks, Oracle General, Oracle Indexes.
19 comments

Thought I might expand a little on the discussion and comments on how NULLs can be indexed and address point #6 on my list of those things you may not have known about indexes

“It’s possible and potentially very useful to just index some column values and not all column values within a table”.

as well as touching on point #4 that “B-Tree Indexes can be extremely useful and beneficial even if the column contains very few distinct values (as low as 1)”.

As previously discussed, index entries which are fully NULL are not indexed by Oracle. We can however use this fact to our advantage.

There are many scenarios whereby we may only search for a rowset based on a subset of the possible values in a column or group of columns. The classic scenario is where we may have a flag or status field denoting “current”, “live”, “not yet processed”, etc. rows and our main transactional queries are only interested in these relatively few rows.

Most rows are “historical”, “processed”, etc. rows and are not generally of interest and when they are of interest represent such a large proportion of the overall table that an index would be inappropriate for these batch jobs or long running reports to access them anyways. Often, (but not always) we might need a histogram to let the CBO know that those column values of interest actually represents a small, non-uniform proportion of the overall rowset.

Because we need to efficiently access those few rows of interest, we generally index the column but in the process also index all the other column values that aren’t of interest as well. It’s all or nothing, right ?

Not necessarily. A possible solution is to use an appropriate function-based index in combination with our understanding that fully null index entries are not actually indexed. For example, let assume we have a very large table that has a STATUS code column. The only column value of interest are those with a status value of ‘BOWIE’, all other values are simply not of direct interest with our OLTP queries. By creating an index such as:

CREATE INDEX index_some_stuff_i ON
index_some_stuff(DECODE(status, ‘BOWIE’, ‘BOWIE’, NULL)) COMPUTE STATISTICS;

the decode function only returns a non-null value for the specific status of “BOWIE”. All other values are converted to nulls and so are not indexed.

We now have an index that consists of nothing but “BOWIE” values. As a result, the index is tiny because the vast majority of column values are simply not indexed. But because the percentage of rows that actually have a status of “BOWIE” is very small, the CBO looks at this index very favourably. By now writing our queries in a manner such as this:

SELECT * FROM index_some_stuff
WHERE(DECODE(status, ‘BOWIE’, ‘BOWIE’, null)) = ‘BOWIE’;

It will hopefully use our nice, small, efficient function-based index.

Not only will this index save us potentially large storage overheads, but if it may be small enough to reduce the height of the index on a permanent basis, thus making the index access more efficient.

See this demo for an example of how we reduced an index with 2924 leaf blocks and a height of 3 down to a height of 1 and just the 1 leaf block.