jump to navigation

Answer: Anything Wrong With Query Performance? (Red Right Hand) April 11, 2018

Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Oracle Indexes.
add a comment

red right hand

I of course attract a highly clever readership :). As some have commented, for a single table to require 1000+ consistent gets to retrieve 1000 rows implies that each row needs to be accessed from a different block. This in turn implies the Clustering Factor for this index to be relatively bad and the associated index relatively inefficient.

If this query is very infrequently executed, then no real damage done and the index is likely a better alternative than a Full Table Scan.

However, if this query was executed very frequently (maybe 100’s of times per second), if this query featured as one of the top consuming CPU queries in an AWR report, then you could be burning more CPU than necessary. Maybe a lot lot more CPU…

Improving database performance is of course desirable but reducing a significant amount of CPU usage is always a good thing. For a start you usually pay database licenses and cloud subscriptions based on CPU consumption. The less CPU your systems use, the more head-room you have in case anything goes wrong as running out of CPU usually means performance hell for your database systems. Less CPU means more time until you need to update your infrastructure, more database systems you can run in your current environment, more time until you need to pay for more database licenses, more time until you have to increase your cloud subscriptions etc.

I have assisted many customers in significantly improving performance, in delaying IT investments costs by significantly reducing CPU wastage. Often this is based on improving queries that individually perform adequately and often when the number of rows to number of consistent gets/logical reads ratios appear OK.

So in this particular example, although things are currently deemed hunky dory,  this query can potentially be significantly improved. The root issue here is an index that has a terrible Clustering Factor being used to retrieve a significant number of rows, while being executed a significant number of times.

If we look at the current Clustering Factor:

SQL> select index_name, clustering_factor from user_indexes

where table_name='MAJOR_TOM';

INDEX_NAME           CLUSTERING_FACTOR
-------------------- -----------------
MAJOR_TOM_CODE_I               2000000

At 2000000, it’s about as bad as it can get.

As I’ve discussed previously, Oracle now has a nice way of being able change the clustering of a table by adding a Clustering Attribute to a table (12.1) and by the reorganising the table online (12.2):

SQL> alter table major_tom add clustering by linear order(code);

Table altered.

SQL> alter table major_tom move online;

Table altered.

If we look at the Clustering Factor of the index now:

SQL> select index_name, clustering_factor from user_indexes where table_name='MAJOR_TOM';

INDEX_NAME           CLUSTERING_FACTOR
-------------------- -----------------
MAJOR_TOM_CODE_I                  7322

It’s now about as good as it can get at just 7322.

If we now re-run the “problematic” query:

SQL> select * from major_tom where code=42;

1000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 4132562429

------------------------------------------------------------------------------------------------
| Id | Operation                   | Name             | Rows | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                  | 1000 | 21000 |       9 (0) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID | MAJOR_TOM        | 1000 | 21000 |       9 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN            | MAJOR_TOM_CODE_I | 1000 |       |       5 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
    0 recursive calls
    0 db block gets
   12 consistent gets
    0 physical reads
    0 redo size
26208 bytes sent via SQL*Net to client
  608 bytes received via SQL*Net from client
    2 SQL*Net roundtrips to/from client
    0 sorts (memory)
    0 sorts (disk)
 1000 rows processed

The number of consistent gets has plummeted from 1006 to just 12, which is about as good as it gets when retrieving 1000 rows.

Of course the impact this change has on other queries on the table based on other columns needs to be carefully considered. But we have now potentially significantly reduced the overall CPU consumption of our database (especially if we tackle other problem queries in a similar manner).

If you have attended by “Oracle Indexing Internals and Best Practices” seminar, you already know all this as this is one of many key messages from the seminar 🙂

Improve Data Clustering on Multiple Columns Concurrently (Two Suns in the Sunset) March 12, 2018

Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Online DDL, Oracle Indexes.
3 comments

I’ve had a couple of recent discussions around clustering and how if you attempt to improve the clustering of a table based on a column, you thereby ruin the current clustering that might exist for a different column. The common wisdom being you can only order the data one way and if you change the order, you might improve things for one column but totally stuff things up for another.

However, that’s not strictly correct. Depending on the characteristics of your data, you can potentially order (or interleave) data based on multiple columns concurrently. It’s quite possible to have good or good enough clustering on multiple columns and this is extremely important for indexes, as the efficiency of an index can be directly impacted by the clustering of data on the underlining tables.

So to illustrate, I’m going to create a table that initially has terrible clustering on two unrelated columns (code and grade) :

SQL> create table ziggy (id number, code number, grade number, name varchar2(42));

Table created.

SQL> insert into ziggy select rownum, mod(rownum, 100)+1, ceil(dbms_random.value(0,100)), 'ZIGGY STARDUST'
from dual connect by level  commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY', method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy_code_i on ziggy(code);

Index created.

SQL> create index ziggy_grade_i on ziggy(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes
where table_name='ZIGGY';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                   1748800    4000000
ZIGGY_GRADE_I                  1572829    4000000

So with values for both columns distributed all throughout the table, the Clustering Factor of both the CODE and GRADE indexes are both quite poor (values of 1748800 and 1572829 respectively). Even though both columns have 100 distinct values (and so a selectivity of 1%), the CBO will likely consider the indexes too inefficient to use:

SQL> select * from ziggy where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time    |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       | 40000 | 1054K |   4985 (10) | 00:00:01|
| * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K |   4985 (10) | 00:00:0 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
   20292 consistent gets
       0 physical reads
       0 redo size
 1058750 bytes sent via SQL*Net to client
   29934 bytes received via SQL*Net from client
    2668 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
  40000 rows processed

SQL> select * from ziggy where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       | 40000 | 1054K |  5021 (10) | 00:00:01 |
| * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K |  5021 (10) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("GRADE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
   20307 consistent gets
       0 physical reads
       0 redo size
 1065641 bytes sent via SQL*Net to client
   30121 bytes received via SQL*Net from client
    2685 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40257 rows processed

So even though the CBO has got the row estimates just about spot on, in both cases a Full Table Scan was chosen.

Let’s create another table based on the table above but this time order the data in CODE column order:

SQL> create table ziggy2 as select * from ziggy order by code;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY2', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy2_code_i on ziggy2(code);

Index created.

SQL> create index ziggy2_grade_i on ziggy2(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY2';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY2_CODE_I                    17561    4000000
ZIGGY2_GRADE_I                 1577809    4000000

We can see that by doing so, we have significantly reduced the Clustering Factor of the CODE index (down from 1748800 to just 17561) . The GRADE index though has changed little as there’s little co-relation between the CODE and GRADE columns.

If we now run the same query with the CODE based predicate:

SQL> select * from ziggy2 where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 16801974

-----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name          | Rows  | Bytes | Cost (%CPU) | Time     |
-----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |               | 40000 | 1054K |     264 (4) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY2        | 40000 | 1054K |     264 (4) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY2_CODE_I | 40000 |       |      84 (5) | 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     273 consistent gets
       0 physical reads
       0 redo size
 1272038 bytes sent via SQL*Net to client
     685 bytes received via SQL*Net from client
       9 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40000 rows processed

The CBO has not only used the index, but the query is much more efficient as a result, with just 273 consistent gets required to retrieve 40000 rows.

However the query based on the GRADE predicate still uses a FTS:

SQL> select * from ziggy2 where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1810052534

----------------------------------------------------------------------------
| Id | Operation         | Name   | Rows  | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------
|  0 | SELECT STATEMENT  |        | 40000 | 1054K |   4920 (10) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY2 | 40000 | 1054K |   4920 (10) | 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("GRADE"=42)

Statistics
----------------------------------------------------------
      0 recursive calls
     11 db block gets
  17602 consistent gets
      0 physical reads
      0 redo size
 434947 bytes sent via SQL*Net to client
    696 bytes received via SQL*Net from client
     10 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
  40257 rows processed

Now if we decide that actually the query based on GRADE is far more important to the business, we could of course reorder the data again. The following is yet another table, this time based on the CODE sorted ZIGGY2 table, but inserted in GRADE column order:

SQL> create table ziggy3 as select * from ziggy2 order by grade;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY3', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy3_code_i on ziggy3(code);

Index created.

SQL> create index ziggy3_grade_i on ziggy3(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY3';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY3_CODE_I                    30231    4000000
ZIGGY3_GRADE_I                   17582    4000000

We notice we now have an excellent, very low Clustering Factor for the GRADE index (down to just 17582). But notice also the Clustering Factor for CODE. Although it has increased from 17561 to 30231, it’s nowhere near as bad as it was initially when is was a massive 1748800.

The point being that with the data already ordered on CODE, Oracle inserting the data in GRADE order effectively had the data already sub-ordered on CODE. So we end up with perfect clustering on the GRADE column and “good enough” clustering on CODE as well.

If we now run the same queries again:

SQL> select * from ziggy3 where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1004048030

-----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name          | Rows  | Bytes | Cost (%CPU) | Time     |
-----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |               | 40000 | 1054K |     392 (3) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3        | 40000 | 1054K |     392 (3) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY3_CODE_I | 40000 |       |      84 (5) | 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     401 consistent gets
       0 physical reads
       0 redo size
 1272038 bytes sent via SQL*Net to client
     685 bytes received via SQL*Net from client
       9 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40000 rows processed

With the CODE based query, the CBO still uses the index and performance is still quite good with consistent gets having  gone up a tad (401 up from 273). However, we now have the scenario where the GRADE based query is also efficient with the index access also selected by the CBO:

SQL> select * from ziggy3 where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 844233985

------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name           | Rows  | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                | 40000 | 1054K |     264 (4) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3         | 40000 | 1054K |     264 (4) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY3_GRADE_I | 40000 |       |      84 (5) | 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("GRADE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     278 consistent gets
       0 physical reads
       0 redo size
 1280037 bytes sent via SQL*Net to client
     696 bytes received via SQL*Net from client
      10 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40257 rows processed

We are relying here however on how Oracle actually loads the data on the non-sorted columns, so we can guarantee good clustering on both these columns by simply ordering the data on both columns. Here’s table number 4 with data explicitly sorted on both columns (the values of CODE sub-sorted within the ordering of GRADE):

SQL> create table ziggy4 as select * from ziggy3 order by grade, code;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY4', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy4_code_i on ziggy4(code);

Index created.

SQL> create index ziggy4_grade_i on ziggy4(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY4';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY4_CODE_I                    27540    4000000
ZIGGY4_GRADE_I                   17583    4000000

We notice we have a near perfect Clustering Factor on the GRADE column (just 17583) and a “good enough” Clustering Factor on the CODE column (27540).

With 12c Rel 2, we can effectively “fix” the original poorly clustered table online on both columns by adding an appropriate Clustering Attribute to the table (new in 12.1) and performing a subsequent Online table reorg (new in 12.2):

SQL> alter table ziggy add clustering by linear order (grade, code);

Table altered.

SQL> alter table ziggy move online;

Table altered.

SQL> select index_name, clustering_factor, num_rows from user_indexes

where table_name='ZIGGY';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                     27525    4000000
ZIGGY_GRADE_I                    17578    4000000

We now have the same excellent Clustering Factor values as we had in the previous example.

Depending on data characteristics, you could potentially use the Interleave Clustering Attribute for good enough Clustering Factor values on your multiple columns, rather than perfect clustering on specific columns.

So it is entirely possible to have the necessary data ordering you need for effective data accesses on multiple columns concurrently.

12.2 Introduction to Real-Time Materialized Views (The View) July 10, 2017

Posted by Richard Foote in 12c, 12c Rel 2, 12c Release 2 New Features, Oracle Indexes, Real-Time Materialized Views.
8 comments

the view

Although I usually focus on index related topics, I’ve always kinda considered Materialized Views (MVs) as an index like structure, which Oracle can automatically update and from which Oracle can efficiently retrieve data. The cost of maintaining a Materialized View Log is not unlike the cost of maintaining an index structure, the benefits of which can potentially far outweigh the overheads.

I just want to introduce a really cool new feature introduced in Oracle Database 12c Release 2 called Real-Time Materialized Views.

To best illustrate, a simple little demo. I first create a table and populate it with 1M rows.

SQL> create table bowie (id number primary key, name varchar2(42), sales number, text varchar2(42));

Table created.

SQL> insert into bowie select rownum, 'BOWIE' || to_char(mod(rownum,100)+1),
trunc(dbms_random.value(0,10000)), 'ZIGGY STARDUST' from dual connect by level<=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');

PL/SQL procedure successfully completed.

I then run the following query which returns only those summary records where the total SALES exceeds some limit:

SQL> select name, sum(sales) from bowie group by name having sum(sales) > 50500000;

NAME                                       SUM(SALES)
------------------------------------------ ----------
BOWIE7                                       50570391
BOWIE55                                      50586083
BOWIE15                                      50636084

Execution Plan
----------------------------------------------------------
Plan hash value: 298288086

-----------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time       |
-----------------------------------------------------------------------------
|  0 | SELECT STATEMENT  |       |    50 |   600 |    1454 (4) | 00:00:01   |
|* 1 | FILTER            |       |       |       |             |            |
|  2 | HASH GROUP BY     |       |    50 |   600 |    1454 (4) | 00:00:01   |
|  3 | TABLE ACCESS FULL | BOWIE | 1000K |   11M |    1410 (1) | 00:00:01   |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(SUM("SALES")>50500000)

Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
5138 consistent gets
   0 physical reads
   0 redo size
 704 bytes sent via SQL*Net to client
 608 bytes received via SQL*Net from client
   2 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   3 rows processed

I don’t have any filtering predicates before the summarisation of the table, meaning I’m currently forced to read the entire table first, before I can filter any summarisations that aren’t of interest.

As such, a full table scan is an expensive operation here (5138 consistent gets).

Now a method to reduce these FTS overheads is to create a Materialized View which has all the summary details pre-defined. Depending on the QUERY_REWRITE_INTEGRITY parameter, I can potentially use Query Rewrite to automatically use the MV to access the pre-summarised data rather than perform the FTS on the base Bowie table.

The MV could be kept fully up to date by performing a FAST REFRESH ON COMMIT but this adds additional overheads on the DMLs on the base table as they have to apply the actual changes to the MVs as part of the transaction. I could reduce these overheads by performing a FAST REFRESH ON DEMAND, but this means the MV may be stale and not fully up to date.

In Oracle Database 12.2, we get the best of both worlds with Real-Time Materialized Views, where we don’t have the additional overheads of a ON COMMIT refresh, but still guarantee fully up to date data by still (hopefully) accessing the MV rather than performing the expensive FTS.

We first create the Materialized View Log (necessary for MV fast refreshes):

SQL> create materialized view log on bowie with sequence, rowid (id, name, sales) including new values;

Materialized view log created.

But now create the MV with the required summary SQL definition, but with the new ENABLE ON QUERY COMPUTATION clause:

SQL> create materialized view bowie_mv
2 refresh fast on demand
3 enable query rewrite
4 enable on query computation
5 as
6 select name, sum(sales) from bowie group by name;

Materialized view created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE_MV');

PL/SQL procedure successfully completed.

If we re-run the summary query again:

SQL> select name, sum(sales) from bowie group by name having sum(sales) > 50500000;

NAME                                       SUM(SALES)
------------------------------------------ ----------
BOWIE7                                       50570391
BOWIE55                                      50586083
BOWIE15                                      50636084

Execution Plan
----------------------------------------------------------
Plan hash value: 593592962

-----------------------------------------------------------------------------------------
| Id | Operation                    | Name     | Rows | Bytes | Cost (%CPU) | Time      |
-----------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT             |          |   10 |   140 |       3 (0) | 00:00:01  |
|* 1 | MAT_VIEW REWRITE ACCESS FULL | BOWIE_MV |   10 |   140 |       3 (0) | 00:00:01  |
-----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("BOWIE_MV"."SUM(SALES)">50500000)

Statistics
----------------------------------------------------------
  0 recursive calls
  2 db block gets
  8 consistent gets
  0 physical reads
  0 redo size
704 bytes sent via SQL*Net to client
608 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  3 rows processed

We notice that Query Rewrite has taken place and the CBO has automatically used the MV (consisting of just 100 rows) to very efficiently access the required summary data (8 consistent gets).

If we look at the current Query Rewrite parameters:

SQL> show parameter query

NAME                                 TYPE        VALUE
------------------------------------ ----------- ------------------------------
inmemory_query                       string      ENABLE
query_rewrite_enabled                string      TRUE
query_rewrite_integrity              string      enforced

We notice that QUERY_REWRITE_INTEGRITY is set to ENFORCED meaning that Oracle enforces and guarantees consistency and integrity. So no stale accesses to the MV will be tolerated here.

If we now add and commit a new row (that effectively adds 1000 to the BOWIE7 summary):

SQL> insert into bowie values (1000001, 'BOWIE7', 1000, 'HUNKY DORY');

1 row created.

SQL> commit;

Commit complete.

And now again re-run the summary query:

SQL> select name, sum(sales) from bowie group by name having sum(sales) > 50500000;

NAME                                       SUM(SALES)
------------------------------------------ ----------
BOWIE55                                      50586083
BOWIE15                                      50636084
BOWIE7                                       50571391

We notice the returned data is fully up to date (the total for BOWIE7 has indeed increased by the 1000 added).

And it did so efficiently without having to perform a massive FTS on the base table. A look at the execution plan reveals how:

Execution Plan
----------------------------------------------------------
Plan hash value: 3454774452

-----------------------------------------------------------------------------------------------------------------------
| Id | Operation                                  | Name                       | Rows | Bytes | Cost (%CPU)|     Time |
-----------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                          |                            |  203 |  7308 |     22 (28)| 00:00:01 |
|   1 |  VIEW                                     |                            |  203 |  7308 |     22 (28)| 00:00:01 |
|   2 |   UNION-ALL                               |                            |      |       |            |          |
| * 3 |    VIEW                                   | VW_FOJ_0                   |  100 |  3900 |      8 (25)| 00:00:01 |
| * 4 |     HASH JOIN OUTER                       |                            |  100 |  2500 |      8 (25)| 00:00:01 |
|   5 |      VIEW                                 |                            |  100 |  1700 |       3 (0)| 00:00:01 |
|   6 |       MAT_VIEW ACCESS FULL                | BOWIE_MV                   |  100 |  1400 |       3 (0)| 00:00:01 |
|   7 |      VIEW                                 |                            |    1 |     8 |      5 (40)| 00:00:01 |
|   8 |       HASH GROUP BY                       |                            |    1 |    36 |      5 (40)| 00:00:01 |
|   9 |        VIEW                               |                            |    1 |    36 |      4 (25)| 00:00:01 |
|  10 |         RESULT CACHE                      | csfyggq82gxrn757upr194x2g2 |      |       |            |          |
|* 11 |          VIEW                             |                            |    1 |   100 |      4 (25)| 00:00:01 |
|  12 |           WINDOW SORT                     |                            |    1 |   191 |      4 (25)| 00:00:01 |
|* 13 |            TABLE ACCESS FULL              | MLOG$_BOWIE                |    1 |   191 |       3 (0)| 00:00:01 |
|* 14 |   VIEW                                    | VW_FOJ_1                   |  102 |  5304 |      8 (25)| 00:00:01 |
|* 15 |    HASH JOIN FULL OUTER                   |                            |  102 |  3774 |      8 (25)| 00:00:01 |
|  16 |     VIEW                                  |                            |    1 |    30 |      5 (40)| 00:00:01 |
|  17 |      HASH GROUP BY                        |                            |    1 |    36 |      5 (40)| 00:00:01 |
|  18 |       VIEW                                |                            |    1 |    36 |      4 (25)| 00:00:01 |
|  19 |        RESULT CACHE                       | csfyggq82gxrn757upr194x2g2 |      |       |            |          |
|* 20 |         VIEW                              |                            |    1 |   100 |      4 (25)| 00:00:01 |
|  21 |          WINDOW SORT                      |                            |    1 |   191 |      4 (25)| 00:00:01 |
|* 22 |           TABLE ACCESS FULL               | MLOG$_BOWIE                |    1 |   191 |       3 (0)| 00:00:01 |
|  23 |   VIEW                                    |                            |  100 |   700 |       3 (0)| 00:00:01 |
|  24 |    MAT_VIEW ACCESS FULL                   | BOWIE_MV                   |  100 |  1400 |       3 (0)| 00:00:01 |
|  25 | NESTED LOOPS                              |                            |    1 |    75 |      6 (34)| 00:00:01 |
|  26 |  VIEW                                     |                            |    1 |    52 |      5 (40)| 00:00:01 |
|  27 |   HASH GROUP BY                           |                            |    1 |    36 |      5 (40)| 00:00:01 |
|  28 |    VIEW                                   |                            |    1 |    36 |      4 (25)| 00:00:01 |
|  29 |     RESULT CACHE                          | csfyggq82gxrn757upr194x2g2 |      |       |            |          |
|* 30 |      VIEW                                 |                            |    1 |   100 |      4 (25)| 00:00:01 |
|  31 |       WINDOW SORT                         |                            |    1 |   191 |      4 (25)| 00:00:01 |
|* 32 |        TABLE ACCESS FULL                  | MLOG$_BOWIE                |    1 |   191 |       3 (0)| 00:00:01 |
|* 33 |  MAT_VIEW ACCESS BY INDEX ROWID           | BOWIE_MV                   |    1 |    23 |       1 (0)| 00:00:01 |
|* 34 |   INDEX UNIQUE SCAN                       | I_SNAP$_BOWIE_MV           |    1 |       |       0 (0)| 00:00:01 |
-----------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

3 - filter("AV$0"."OJ_MARK" IS NULL AND "SNA$0"."SUM(SALES)">50500000)
4 - access(SYS_OP_MAP_NONNULL("SNA$0"."NAME")=SYS_OP_MAP_NONNULL("AV$0"."GB0"(+)))
11 - filter(SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('6E00') AND
"MAS$"."SEQ$$"="MAS$"."MAXSEQ$$" OR (SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=H
EXTORAW('6F00') OR SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('7500'))
AND "MAS$"."SEQ$$"="MAS$"."MINSEQ$$")
13 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-03-31 15:47:28', 'syyyy-mm-dd hh24:mi:ss'))
14 - filter("SNA$0"."SNA_OJ_MARK" IS NULL AND DECODE("AV$0"."H0",0,TO_NUMBER(NULL),"AV$0"."D0")>50500000)
15 - access(SYS_OP_MAP_NONNULL("SNA$0"."NAME")=SYS_OP_MAP_NONNULL("AV$0"."GB0"))
20 - filter(SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('6E00') AND
"MAS$"."SEQ$$"="MAS$"."MAXSEQ$$" OR (SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=H
EXTORAW('6F00') OR SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('7500'))
AND "MAS$"."SEQ$$"="MAS$"."MINSEQ$$")
22 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-03-31 15:47:28', 'syyyy-mm-dd hh24:mi:ss'))
30 - filter(SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('6E00') AND
"MAS$"."SEQ$$"="MAS$"."MAXSEQ$$" OR (SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=H
EXTORAW('6F00') OR SYS_OP_CSEE(NLSSORT("MAS$"."OLD_NEW$$",'nls_sort=''BINARY_CI'''))=HEXTORAW('7500'))
AND "MAS$"."SEQ$$"="MAS$"."MINSEQ$$")
32 - filter("MAS$"."SNAPTIME$$">TO_DATE(' 2017-03-31 15:47:28', 'syyyy-mm-dd hh24:mi:ss'))
33 - filter(DECODE(TO_CHAR("BOWIE_MV"."SUM(SALES)"),NULL,DECODE("AV$0"."H0",0,TO_NUMBER(NULL),"AV$0"."
D0"),"BOWIE_MV"."SUM(SALES)"+"AV$0"."D0")>50500000)
34 - access(SYS_OP_MAP_NONNULL("NAME")=SYS_OP_MAP_NONNULL("AV$0"."GB0"))

Result Cache Information (identified by operation id):
------------------------------------------------------

10 - column-count=7; dependencies=(BOWIE.MLOG$_BOWIE); attributes=(ordered, session-lifetime); parameters=(nls); name="DMLTYPES:MLOG$_BOWIE"
19 - column-count=7; dependencies=(BOWIE.MLOG$_BOWIE); attributes=(ordered, session-lifetime); parameters=(nls); name="DMLTYPES:MLOG$_BOWIE"
29 - column-count=7; dependencies=(BOWIE.MLOG$_BOWIE); attributes=(ordered, session-lifetime); parameters=(nls); name="DMLTYPES:MLOG$_BOWIE"

Note
-----
- dynamic statistics used: dynamic sampling (level=2)
- this is an adaptive plan

Statistics
----------------------------------------------------------
  0 recursive calls
  4 db block gets
 16 consistent gets
  0 physical reads
  0 redo size
704 bytes sent via SQL*Net to client
608 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  3 rows processed

Oracle can now merge the data from the MV with the data still within the MV Log to generate the final, fully up to date result. At just 16 consistent gets, this is more expensive than the fully refreshed MV (8 consistent gets) but much less than the 5138 consistent gets when accessing the base BOWIE table via a FTS.

And providing the costs of doing so is calculated as less by the CBO than performing the FTS (or otherwise) on the base table, then Oracle will perform this new smart when accessing data from such created MVs.

Very nice 🙂

Oracle 12c: Indexing JSON in the Database Part III (Paperback Writer) September 2, 2016

Posted by Richard Foote in 12c, JSON, JSON Text Index, Oracle Indexes.
3 comments

In Part I and Part II, we looked at how to index specific attributes within a JSON document store within an Oracle 12c database.

But what if we’re not sure which specific attributes might benefit from an index or indeed, as JSON is by it’s nature a schema-less way to store data, what if we’re not entirely sure what attributes might be present currently or in the future.

On a JSON document store within the Oracle Database, you can create a special JSON aware Text Index that can automatically index any field/attribute within a JSON document and use a Text based function to then search efficiently for data from any attribute.

Using the same table created in Part I, you can create a JSON Text index as follows:

SQL> CREATE INDEX ziggy_search_idx ON ziggy_json (ziggy_order)
2 INDEXTYPE IS CTXSYS.CONTEXT
3 PARAMETERS ('section group CTXSYS.JSON_SECTION_GROUP SYNC (ON COMMIT)');
Index created.

Note this Text index is (optionally) defined to be automatically synchronised when data in the ZIGGY_JSON table is committed.

We can use the JSON_TEXTCONTAINS Oracle Text function to efficiently access data for any data within the JSON defined column. For example:

SQL> SELECT * FROM ziggy_json WHERE json_textcontains(ziggy_order, '$.Reference', 'DBOWIE-201642');

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 3069169778

------------------------------------------------------------------------------------------------
| Id | Operation                   | Name             | Rows | Bytes | Cost (%CPU) | Time      |
------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                  |    1 |  1534 |    2550 (0) |  00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID | ZIGGY_JSON       |    1 |  1534 |    2550 (0) |  00:00:01 |
|* 2 | DOMAIN INDEX                | ZIGGY_SEARCH_IDX |      |       |    2549 (0) |  00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CTXSYS"."CONTAINS"("ZIGGY_JSON"."ZIGGY_ORDER",'{DBOWIE-201642}
 INPATH(/Reference)')>0)

Statistics
----------------------------------------------------------
  65 recursive calls
   0 db block gets
 118 consistent gets
   0 physical reads
   0 redo size
1863 bytes sent via SQL*Net to client
1088 bytes received via SQL*Net from client
   6 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   2 rows processed

Note the JSON-based Text index is used to retrieve data efficiently.

The Text index can also be used to search data efficiently from within an array set:

SQL> SELECT * FROM ziggy_json WHERE json_textcontains(ziggy_order, '$.LineItems.Part.Description', 'Low');

no rows selected

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 3069169778

------------------------------------------------------------------------------------------------
| Id | Operation                   | Name             | Rows | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                  |    1 |  1534 |    5927 (0) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID | ZIGGY_JSON       |    1 |  1534 |    5927 (0) | 00:00:01 |
|* 2 | DOMAIN INDEX                | ZIGGY_SEARCH_IDX |      |       |    5927 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CTXSYS"."CONTAINS"("ZIGGY_JSON"."ZIGGY_ORDER",'{Low}
 INPATH(/LineItems/Part/Description)')>0)

Statistics
----------------------------------------------------------
132 recursive calls
  0 db block gets
182 consistent gets
  0 physical reads
  0 redo size
627 bytes sent via SQL*Net to client
552 bytes received via SQL*Net from client
  2 SQL*Net roundtrips to/from client
  0 sorts (memory)
  0 sorts (disk)
  0 rows processed

We can also search for a specific data value across any attribute within the JSON document store:

SQL> SELECT * FROM ziggy_json WHERE json_textcontains(ziggy_order, '$', '4242');

Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 3069169778

------------------------------------------------------------------------------------------------
| Id | Operation                   | Name             | Rows | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                  |    1 |  1534 |       2 (0) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID | ZIGGY_JSON       |    1 |  1534 |       2 (0) | 00:00:01 |
|* 2 | DOMAIN INDEX                | ZIGGY_SEARCH_IDX |      |       |       1 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

 2 - access("CTXSYS"."CONTAINS"("ZIGGY_JSON"."ZIGGY_ORDER",'{4242}')>0)
     
Statistics
----------------------------------------------------------
  10 recursive calls
   0 db block gets
  32 consistent gets
   0 physical reads
   0 redo size
1865 bytes sent via SQL*Net to client
1088 bytes received via SQL*Net from client
   6 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   2 rows processed

Let’s now add more data to the JSON column, but this time introducing a few new attributes (such as AlbumName):

SQL> insert into ziggy_json
 2 select
 3 rownum,
 4 SYSdate,
 5 '{"AlbumId" : ' || rownum || ',
 6 "AlbumName" : "HUNKY DORY",
 7 "ArtistName" : "David Bowie"}'
 8 from dual connect by level <= 10;
10 rows created.

SQL> commit;

Commit complete.

As the JSON-based Text index was defined to be automatically synchronised when we commit data in the table, these new attributes can be immediately searched and accessed via the index:

SQL> SELECT * FROM ziggy_json WHERE json_textcontains(ziggy_order, '$.AlbumName', 'HUNKY DORY');

10 rows selected.

Elapsed: 00:00:00.02

Execution Plan
----------------------------------------------------------
Plan hash value: 3069169778

------------------------------------------------------------------------------------------------
| Id | Operation                   | Name             | Rows | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT            |                  |  198 |  296K |    1948 (0) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID | ZIGGY_JSON       |  198 |  296K |    1948 (0) | 00:00:01 |
|* 2 | DOMAIN INDEX                | ZIGGY_SEARCH_IDX |      |       |    1780 (0) | 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CTXSYS"."CONTAINS"("ZIGGY_JSON"."ZIGGY_ORDER",'{HUNKY DORY}
 INPATH(/AlbumName)')>0)
     
Statistics
----------------------------------------------------------
  48 recursive calls
   0 db block gets
 103 consistent gets
   0 physical reads
   0 redo size
6751 bytes sent via SQL*Net to client
3232 bytes received via SQL*Net from client
  22 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
  10 rows processed

Not only can JSON data be stored within the Oracle 12c database, but we have a number of index strategies available to search such data efficiently.

Oracle 12c: Indexing JSON in the Database Part II (Find A Little Wood) August 5, 2016

Posted by Richard Foote in 12c, JSON, JSON Text Index, Oracle Indexes.
4 comments

In Part I, we looked at how you can now store JSON documents within the Oracle 12c Database. For efficient accesses to JSON documents stored in the Oracle database, we can either create a function-based index based on the JSON_VALUE function or on JSON .dot notation.

These indexes are useful for indexing specific JSON attributes, but what if we want to index multiple JSON attributes within a single index structure.

To start, I’m just going to add an extra row to increase the selectivity of other columns.

SQL> insert into ziggy_json
2 select
3 100001,
4 SYSdate,
5 '{"PONumber" : 1000001,
6 "Reference" : "MTOM-20161",
7 "Requestor" : "Major Tom",
8 "User" : "MTOM",
9 "CostCenter" : "B42",
10 "ShippingInstructions" : {"name" : "Major Tom",
11 "Address": {"street" : "42 Ziggy Street",
12 "city" : "Canberra",
13 "state" : "ACT",
14 "zipCode" : 2601,
15 "country" : "Australia"},
16 "Phone" : [{"type" : "Office", "number" : "417-555-7777"},
17 {"type" : "Mobile", "number" : "417-555-1234"}]},
18 "Special Instructions" : null,
19 "AllowPartialShipment" : true,
20 "LineItems" : [{"ItemNumber" : 1,
21 "Part" : {"Description" : "Hunky Dory",
22 "UnitPrice" : 10.95},
23 "Quantity" : 5.0},
24 {"ItemNumber" : 2,
25 "Part" : {"Description" : "Pin-Ups",
26 "UnitPrice" : 10.95},
27 "Quantity" : 3.0}]}'
28 from dual;

1 row created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

We can still create composite indexes based on the JSON_VALUE function as we can with conventional columns:

SQL> create index ziggy_json_idx3 on 
     ziggy_json(json_value(ziggy_order, '$.User' RETURNING VARCHAR2(20)), 
                json_value(ziggy_order, '$.CostCenter' RETURNING VARCHAR2(6)));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

If we now run a query with a couple of JSON_VALUE based predicates:

SQL> select * from ziggy_json
     where json_value(ziggy_order, '$.User' RETURNING VARCHAR2(20))='MTOM' and 
           json_value(ziggy_order, '$.CostCenter' RETURNING VARCHAR2(6)) = 'B42';

Execution Plan
----------------------------------------------------------
Plan hash value: 3402615542

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |    1 |  1533 |       4 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON      |    1 |  1533 |       4 (0) | 00:00:01 |
|*2  | INDEX RANGE SCAN                    | ZIGGY_JSON_IDX3 |    1 |       |       3 (0) | 00:00:01 |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("ZIGGY_ORDER" FORMAT JSON , '$.User' RETURNING VARCHAR2(20) NULL ON
ERROR)='MTOM' AND JSON_VALUE("ZIGGY_ORDER" FORMAT JSON , '$.CostCenter' RETURNING VARCHAR2(6) NULL ON ERROR)='B42')
 
Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   5 consistent gets
   0 physical reads
   0 redo size
1248 bytes sent via SQL*Net to client
 820 bytes received via SQL*Net from client
   4 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   1 rows processed

The composite index is effectively used by the CBO as expected.

It does though make our SQL a little cumbersome to write. To simplify things a tad, we could create a couple of virtual columns based on these functions, create the JSON function-based indexes on these virtual columns and simplify the SQL accordingly.

First, we create the virtual columns (note they’re virtual columns and so consume no storage):

 

SQL> ALTER TABLE ziggy_json ADD (userid VARCHAR2(20)
2 GENERATED ALWAYS AS (json_value(ziggy_order, '$.User' RETURNING VARCHAR2(20))));

Table altered.

SQL> ALTER TABLE ziggy_json ADD (costcenter VARCHAR2(6)
2 GENERATED ALWAYS AS (json_value(ziggy_order, '$.CostCenter' RETURNING VARCHAR2(6))));

Table altered.

 

Next, create the index based on these newly created virtual columns:

SQL> CREATE INDEX ziggy_user_costctr_idx on ziggy_json(userid, costcenter);

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

And then write a simplified version of the SQL to reference the virtual columns:

SQL> select * from ziggy_json where userid='MTOM' and costcenter='B42';

ID ZIGGY_DAT
---------- ---------
ZIGGY_ORDER
--------------------------------------------------------------------------------

USERID COSTCE
-------------------- ------
100001 24-JUN-16
{"PONumber" : 1000001,
"Reference" : "MTOM-20161",
MTOM B42
  
Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 5717455

--------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                   | Rows | Bytes | Cost (%CPU) | Time     |
--------------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                        |    1 |  1535 |       4 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON             |    1 |  1535 |       4 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_USER_COSTCTR_IDX |    1 |       |       3 (0) | 00:00:01 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("USERID"='MTOM' AND "COSTCENTER"='B42')
   
Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   5 consistent gets
   0 physical reads
   0 redo size
1396 bytes sent via SQL*Net to client
 820 bytes received via SQL*Net from client
   4 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   1 rows processed

The index is again used as expected.

Of course, if we still want to reference the JSON functions directly within the SQL, the query can still be written as previously:

SQL> select * from ziggy_json 
     where json_value(ziggy_order, '$.User' returning varchar2(20))='MTOM' and
           json_value(ziggy_order, '$.CostCenter' returning varchar2(6))='B42';

ID ZIGGY_DAT
---------- ---------
ZIGGY_ORDER
--------------------------------------------------------------------------------

USERID COSTCE
-------------------- ------
100001 24-JUN-16
{"PONumber" : 1000001,
"Reference" : "MTOM-20161",
MTOM B42
   
Elapsed: 00:00:00.00

Execution Plan
----------------------------------------------------------
Plan hash value: 5717455

--------------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name                   | Rows | Bytes | Cost (%CPU) | Time     |
--------------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                        |    1 |  1535 |       4 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON             |    1 |  1535 |       4 (0) | 00:00:01 |
|*2  | INDEX RANGE SCAN                    | ZIGGY_USER_COSTCTR_IDX |    1 |       |       3 (0) | 00:00:01 |
--------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_JSON"."USERID"='MTOM' AND "ZIGGY_JSON"."COSTCENTER"='B42')

Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   5 consistent gets
   0 physical reads
   0 redo size
1396 bytes sent via SQL*Net to client
 820 bytes received via SQL*Net from client
   4 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   1 rows processed

The index is again used as expected.

I’ll next look at using a JSON Text based index to effectively index the entire JSON document.

Oracle 12c: Indexing JSON In The Database Part I (Lazarus) August 2, 2016

Posted by Richard Foote in 12c, JSON, Oracle Indexes.
1 comment so far

One of the very cool new features introduced in Oracle Database 12c Rel 1 is the ability to store JavaScript Object Notation (JSON) documents within the database. Unlike XML which has its own data type, JSON data can be stored as VARCHAR2, CLOB or BLOB data types, but with a JSON check constraint to ensure the stored parsed document meets JSON document standards.

This enables ‘No-SQL’ schema-less type development within the Oracle Database  for “next generation” applications. Although developers can work with the JSON store without using SQL (via say RESTful APIs directly within the database), you can still leverage the power of SQL for reporting and analytics type purposes. Of course, reading JSON data efficiently then becomes important, and that’s where indexing the JSON document store kicks in.

Let’s look at a simple example.

Firstly, let’s create a table with a column called ZIGGY_ORDER which stores JSON documents.

SQL> CREATE TABLE ziggy_json  (id number,  ziggy_date date,  ziggy_order CLOB  CONSTRAINT ensure_ziggy_json CHECK (ziggy_order IS JSON));</pre>
Table created.

The ZIGGY_ORDER column has a JSON check constraint which ensures only valid JSON documents can be stored. An attempt to insert a row with invalid JSON data will fail:

SQL> insert into ziggy_json values (1, sysdate, '{"This is not legal JSON"}');
insert into ziggy_json values (1, sysdate, '{"This is not legal JSON"}')
*
ERROR at line 1:
ORA-02290: check constraint (BOWIE.ENSURE_ZIGGY_JSON) violated

Let’s insert some JSON data into the table. Note I’m using ROWNUM to insert some almost unique JSON data and then re-inserting the data again to get data worth accessing via an index:

SQL> insert into ziggy_json
2 select
3 rownum,
4 SYSdate,
5 '{"PONumber" : ' || rownum || ',
6 "Reference" : "DBOWIE-2016' || rownum || '",
7 "Requestor" : "David Bowie",
8 "User" : "DBOWIE",
9 "CostCenter" : "A42",
10 "ShippingInstructions" : {"name" : "David Bowie",
11 "Address": {"street" : "42 Ziggy Street",
12 "city" : "Canberra",
13 "state" : "ACT",
14 "zipCode" : 2601,
15 "country" : "Australia"},
16 "Phone" : [{"type" : "Office", "number" : "417-555-7777"},
17 {"type" : "Mobile", "number" : "417-555-1234"}]},
18 "Special Instructions" : null,
19 "AllowPartialShipment" : true,
20 "LineItems" : [{"ItemNumber" : 1,
21 "Part" : {"Description" : "Hunky Dory",
22 "UnitPrice" : 10.95},
23 "Quantity" : 5.0},
24 {"ItemNumber" : 2,
25 "Part" : {"Description" : "Pin-Ups",
26 "UnitPrice" : 10.95},
27 "Quantity" : 3.0}]}'
28 from dual connect by level <= 1000000;

1000000 rows created.

SQL> insert into ziggy_json select * from ziggy_json;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

There are a number of ways we can reference and access data from within JSON. One method is use .dot notation to access specific JSON data elements of interest:

SQL> SELECT z.ziggy_order.PONumber FROM ziggy_json z where z.ziggy_order.PONumber=42;

PONUMBER
--------------------------------------------------------------------------------

42
42

We can also use the JSON_VALUE function to access specific JSON data of interest:

SQL> select json_value(ziggy_order, '$.Reference') from ziggy_json  
     where json_value(ziggy_order, '$.PONumber' returning number)=42;

JSON_VALUE(ZIGGY_ORDER,'$.REFERENCE')
--------------------------------------------------------------------------------

DBOWIE-201642
DBOWIE-201642

Without an index, the CBO has no choice but to use an expensive Full Table Scan:

SQL> SELECT z.ziggy_order.PONumber FROM ziggy_json z where z.ziggy_order.PONumber=42;

Elapsed: 00:00:34.04

Execution Plan
----------------------------------------------------------
Plan hash value: 1413303849

--------------------------------------------------------------------------------
| Id | Operation         | Name       | Rows  | Bytes | Cost (%CPU) | Time     |
--------------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |            | 20000 |   28M |    129K (3) | 00:00:06 |
|* 1 | TABLE ACCESS FULL | ZIGGY_JSON | 20000 |   28M |    129K (3) | 00:00:06 |
--------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_NUMBER(JSON_QUERY("Z"."ZIGGY_ORDER" FORMAT JSON ,
'$.PONumber' RETURNING VARCHAR2(4000) ASIS WITHOUT ARRAY WRAPPER NULL
ON ERROR))=42)

Statistics
----------------------------------------------------------
     0 recursive calls
     0 db block gets
500057 consistent gets
299745 physical reads
     0 redo size
   596 bytes sent via SQL*Net to client
   552 bytes received via SQL*Net from client
     2 SQL*Net roundtrips to/from client
     0 sorts (memory)
     0 sorts (disk)
     2 rows processed

 

At some 500,057 consistent gets and an elapsed time of 34.04 seconds, the above query is “slow” and expensive.

So one option to speed things up is to create a function-based index using the JSON_VALUE function. This can then be used to quickly access data that matches specific JSON name values of interest:

SQL> CREATE INDEX ziggy_po_num_idx ON 
     ziggy_json (json_value(ziggy_order, '$.PONumber' RETURNING NUMBER ERROR ON ERROR));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

As it’s a function-based index, collecting statistics on the implicitly created virtual column is advisable. If we now re-run the query:

SQL> select * from ziggy_json 
     where json_value(ziggy_order, '$.PONumber' returning number)=42;

ID ZIGGY_DAT
---------- ---------
ZIGGY_ORDER
------------------------------------------------------------
42 24-JUN-16
{"PONumber" : 42,
"Reference" : "DBOWIE-201642",
42 24-JUN-16
{"PONumber" : 42,
"Reference" : "DBOWIE-201642",

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 1939019025

--------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name             | Rows | Bytes | Cost (%CPU) | Time     |
--------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                  |    2 |  3058 |       5 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON       |    2 |  3058 |       5 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_PO_NUM_IDX |    2 |       |       3 (0) | 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("ZIGGY_ORDER" FORMAT JSON , '$.PONumber' RETURNING NUMBER ERROR ON
ERROR)=42)
Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   9 consistent gets
   0 physical reads
   0 redo size
1863 bytes sent via SQL*Net to client
1088 bytes received via SQL*Net from client
   6 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   2 rows processed

The query now uses the JSON function-based index, performs just 9 consistent gets and completes in 0.01 second.

The index can be used for any data accesses in which the CBO considers the index the cheaper alternative:

SQL> select * from ziggy_json where json_value(ziggy_order, '$.PONumber' returning number)<42;

82 rows selected.

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 1939019025

--------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name             | Rows | Bytes | Cost (%CPU) | Time     |
--------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                  |   82 |  122K |      86 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON       |   82 |  122K |      86 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_PO_NUM_IDX |   82 |       |       3 (0) | 00:00:01 |
--------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_VALUE("ZIGGY_ORDER" FORMAT JSON , '$.PONumber' RETURNING NUMBER ERROR ON
ERROR)<42)

Statistics
----------------------------------------------------------
    0 recursive calls
    0 db block gets
  249 consistent gets
    0 physical reads
    0 redo size
50623 bytes sent via SQL*Net to client
22528 bytes received via SQL*Net from client
  166 SQL*Net roundtrips to/from client
    0 sorts (memory)
    0 sorts (disk)
   82 rows processed

Indexes can also be created based on the .dot JSON notation:

SQL> CREATE INDEX ziggy_po_num_idx2 ON ziggy_json z (z.ziggy_order.PONumber);

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

A query based on the associated JSON .dot notation can now run efficiently via the index:

SQL> SELECT * FROM ziggy_json z where z.ziggy_order.PONumber='42';

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 4224387816

---------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name              | Rows | Bytes| Cost (%CPU) | Time     |
---------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                   |    2 | 3062 |       6 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON        |    2 | 3062 |       6 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_PO_NUM_IDX2 |    2 |      |       3 (0) | 00:00:01 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(JSON_QUERY("ZIGGY_ORDER" FORMAT JSON , '$.PONumber' RETURNING VARCHAR2(4000) ASIS
WITHOUT ARRAY WRAPPER NULL ON ERROR)='42')

Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   9 consistent gets
   0 physical reads
   0 redo size
1863 bytes sent via SQL*Net to client
1088 bytes received via SQL*Net from client
   6 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   2 rows processed

Query again uses the index and is just as efficient with almost immediate response times at just 9 consistent gets.

JSON indexes can also be used to police and ensure data constraints and integrity (which can be problematic with JSON documents). The following numeric index example also implicitly adds data constraint capabilities:

SQL> CREATE INDEX ziggy_po_num_idx3 ON ziggy_json z (to_number(z.ziggy_order.PONumber));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY_JSON');

PL/SQL procedure successfully completed.

SQL> insert into ziggy_json
2 select
3 rownum,
4 SYSdate,
5 '{"PONumber" : "200000A",
6 "Reference" : "DBOWIE-2016' || rownum || '",
7 "Requestor" : "David Bowie",
8 "User" : "DBOWIE",
9 "CostCenter" : "A42",
10 "ShippingInstructions" : {"name" : "David Bowie",
11 "Address": {"street" : "42 Ziggy Street",
12 "city" : "Canberra",
13 "state" : "ACT",
14 "zipCode" : 2601,
15 "country" : "Australia"},
16 "Phone" : [{"type" : "Office", "number" :"417-555-7777"},
17 {"type" : "Mobile", "number" :"417-555-1234"}]},
18 "Special Instructions" : null,
19 "AllowPartialShipment" : true,
20 "LineItems" : [{"ItemNumber" : 1,
21 "Part" : {"Description" : "Hunky Dory",
22 "UnitPrice" : 10.95},
23 "Quantity" : 5.0},
24 {"ItemNumber" : 2,
25 "Part" : {"Description" : "Pin-Ups",
26 "UnitPrice" : 10.95},
27 "Quantity" : 3.0}]}'
28 from dual;
insert into ziggy_json
*
ERROR at line 1:
ORA-01722: invalid number

The PONumber value has to now be numeric for it to be successfully added to the JSON document store. The index of course can also be used for efficient data access:

SQL> SELECT * FROM ziggy_json z where to_number(z.ziggy_order.PONumber)=42;

Elapsed: 00:00:00.01

Execution Plan
----------------------------------------------------------
Plan hash value: 692052820

---------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name              | Rows | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                   |    2 |  3048 |      5 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY_JSON        |    2 |  3048 |      5 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_PO_NUM_IDX3 |    2 |       |      3 (0) | 00:00:01 |
---------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_NUMBER(JSON_QUERY("ZIGGY_ORDER" FORMAT JSON , '$.PONumber' RETURNING
VARCHAR2(4000) ASIS WITHOUT ARRAY WRAPPER NULL ON ERROR))=42)

Statistics
----------------------------------------------------------
   0 recursive calls
   0 db block gets
   9 consistent gets
   0 physical reads
   0 redo size
1863 bytes sent via SQL*Net to client
1088 bytes received via SQL*Net from client
   6 SQL*Net roundtrips to/from client
   0 sorts (memory)
   0 sorts (disk)
   2 rows processed

In Part II, I’ll look at how to create composite JSON indexes and how to use a text index to automatically index all name fields within a JSON document.

Storing Date Values As Numbers (The Numbers) June 1, 2016

Posted by Richard Foote in 12c, CBO, Histograms, Oracle Indexes, Storing Dates As Numbers.
10 comments

In my last couple of posts, I’ve been discussing how storing date data in a character based column is a really really bad idea.

In a follow-up question, I was asked if storing dates in NUMBER format was a better option. The answer is that it’s probably an improvement from storing dates as strings but it’s still a really really bad idea. Storing dates in DATE format is easily the best option as is storing any data in its native data type.

In this post, I’ll highlight a few of the classic issues with storing dates in basic number format as well as showing you some of the calculations on the CBO cardinality estimates.

As usual, the demo starts with a basic little table that I’ll populate with date data stored in a NUMBER column (ZIGGY_DATE):

SQL> create table ziggy (id number, code number, ziggy_date number);
    
Table created.

SQL> insert into ziggy select rownum, mod(rownum,1000), 
to_number(to_char(sysdate-mod(rownum,10000), 'YYYYMMDD')) 
from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

We’ll now collect statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
----------- ------------ ---------- --------------- --- ---
ZIGGY_DATE         10000      .0001 NONE            NO  NO
CODE                1000       .001 NONE            NO  NO
ID               1000000    .000001 NONE            NO  NO

So the ZIGGY_DATE column has 10,000 distinct dates (with 100 rows per distinct date), with a column density of 1/10000 = 0.0001.

Let’s now create a standard B-Tree index on the ZIGGY_DATE column:

SQL> create index ziggy_date_i on ziggy(ziggy_date);
                  
Index created.

If we look a sample of the data in the column and the min/max date ranges:

SQL> select * from ziggy where rownum <11;
        
        ID       CODE ZIGGY_DATE
---------- ---------- ----------
       776        776   20140412
       777        777   20140411
       778        778   20140410
       779        779   20140409
       780        780   20140408
       781        781   20140407
       782        782   20140406
       783        783   20140405
       784        784   20140404
       785        785   20140403

SQL> select min(ziggy_date) min, max(ziggy_date) max from ziggy;

       MIN        MAX
---------- ----------
  19890110   20160527

We see that all the data in the ZIGGY_DATE column are just number representations of dates, with a range between 10 Jan 1989 and 27 May 2016.

Note there are actually 10,000 days between the dates but the CBO would estimate a range of  270,417 possible days (20160527 – 19890110 = 270,417). The CBO has no idea that the “numbers” within the column are all dates and that there are ranges of values in which data is relatively popular (e.g. between say 20160101 and 20160131) and ranges of values in which data is relatively unpopular (e.g. say between 20154242 and 20159999).

Although not as bad as the range of possible unpopular values found within a character data type as I discussed previously when storing date data as a string, there is still enough data skew when storing dates as numbers to be problematic to the CBO.

If we select just one date with an equality predicate:

SQL> select * from ziggy where ziggy_date = 20150613;
                 
100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  100 |  1500 |     103 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  100 |  1500 |     103 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  100 |       |       3 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE"=20150613)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
110 consistent gets
0 physical reads
0 redo size
3883 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

The CBO gets things spot on, correctly estimating 100 rows to be returned, as the CBO knows there are only 10,000 distinct values of which only one of those values is being selected.

Selectivity is basically the density of the column = 1/10000 = 0.0001, so the estimated cardinality is 0.0001 x 1M rows = 100 rows. Perfect.

However, if we perform a range based query as follows:

SQL> select * from ziggy where ziggy_date between 20151010 and 20151111;
     
3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  573 |  8595 |     580 (1) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  573 |  8595 |     580 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  573 |       |       4 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE">=20151010 AND "ZIGGY_DATE"<=20151111)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3531 consistent gets
0 physical reads
0 redo size
108973 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO has got things somewhat incorrect in this example and has underestimated the expect number of rows (573 rows vs. the 3,300 rows actually returned).

The actual number of days between these dates is 33 so the actual ratio of data returned is 33/10000 x 1M rows = 3,300 rows. This is a range of “numbers” that overall covers a relatively “popular” range of  date values.

However Oracle is estimating a range of some 20151111 – 20151010 = 101 days between these dates. As the total range of possible days 20160527-19890110 = 270,417, the estimated ratio of returned rows is 101/270417 plus 2 x selectivity of a day for the implicit 2 equality conditions (as a between is effectively >= and <=). The selectivity of one day is just the density of the column, 0.0001 as illustrated in the previous query.

Therefore, the query selectivity is derived as being (101/270417) + (2 x 0.0001) = 0.000573 when multiplied by 1M rows = 573 rows as estimated by the CBO.

So the CBO is rather significantly *under* estimating the rows to be returned which could result in a sub-optimal execution plan (such as the inappropriate use of an index range scan as in this example, noting the poor clustering of the data).

If we now look at another range scan below:

SQL> select * from ziggy where ziggy_date between 20151225 and 20160101;
    
800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 33023 |  483K |    810 (15) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 33023 |  483K |    810 (15) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ZIGGY_DATE">=20151225 AND "ZIGGY_DATE"<=20160101)

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2824 consistent gets
0 physical reads
0 redo size
23850 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The actual number of days between these dates is only 8 so the actual ratio of data returned is 8/10000 x 1M rows = 800 rows. This is a range of “numbers” that overall covers a relatively “unpopular” range of date values.

However Oracle is estimating a range of some 20160101 – 20151225 = 8876 days between these dates. As the total range of possible days is 20160527-19890110 = 270,417, the estimated ratio of returned rows is 8876/270417 plus 2 x the selectivity of a single day again for the 2 implicit equality conditions.

Therefore, the query selectivity is derived as being (8876/270417) + (2 x 0.0001) = 0.033023 when multiplied by 1M rows = 33,023 rows as estimated by the CBO.

So the CBO is rather significantly *over* estimating the rows to be returned which could again result in a sub-optimal execution plan (or the inappropriate use of a Full Table Scan in this example). The CBO is simply not picking up the fact that most of the possible values between the “number” ranges aren’t valid dates and can’t possibly exist.

Of course, having dates stored as simple numbers means Oracle has no way of ensuring data integrity and can allow “invalid” dates to be inserted:

SQL> insert into ziggy values (1000001, 42, 20160599);
            
1 row created.

SQL> rollback;

Rollback complete.

As with dates stored as strings, we can again address these issues by either collecting histograms for such columns and/or by creating a function-based date index on the column:

SQL> create index ziggy_date_fn_i on ziggy(to_date(ziggy_date,'YYYYMMDD'));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME  NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ ------------ ---------- --------------- --- ---
SYS_NC00004$        10000      .0001 NONE            YES YES
ZIGGY_DATE          10000      .0001 HYBRID          NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

The associated query with the equality predicate has accurate estimates as it did previously:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') = '13-JUN-2015';

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  100 |   2300 |     103 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  100 |   2300 |     103 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  100 |        |       3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')=TO_DATE(' 2015-06-13 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
111 consistent gets
0 physical reads
0 redo size
2877 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

As the virtual column created for the function-based index also has 10,000 distinct values and a corresponding density of 0.0001, the CBO is getting the cardinality estimate of 100 rows spot on.

But importantly, both associated range based queries are now also being accurately costed by the CBO as it now knows the data being searched is date based and hence can more accurately determine the actual expected dates to be returned within the specified “date” ranges.

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '10-OCT-2015' and '11-NOV-2015';

3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 3400 | 78200 |   1061 (35) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 3400 | 78200 |   1061 (35) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE('
2015-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2015-11-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
2991 consistent gets
0 physical reads
0 redo size
95829 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO is now estimating not 573 rows, but 3,400 rows which is much closer to the actual 3,300 rows being returned. As a result, the CBO is now performing a more efficient Full Table Scan (due to the poor Clustering Factor of the index) than the Index Range Scan performed previously.

If we look at the other range scan query:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '25-DEC-2015' and '01-JAN-2016';

800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  900 | 20700 |      909 (1)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  900 | 20700 |      909 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  900 |       |        5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE(' 2015-12-25 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2016-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
861 consistent gets
7 physical reads
0 redo size
18917 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The CBO is now estimating not 33023 rows, but 900 rows which is again much closer to the actual 800 rows being returned. As a result, the CBO is now performing a more efficient Index Range Scan than the Full Table Scan is was previously.

And of course, the database via the function-based date index now has a manner in which protect the integrity of the date data:

SQL> insert into ziggy values (1000001, 42, 20160599);
insert into ziggy values (1000001, 42, 20160599)
*
ERROR at line 1:
ORA-01847: day of month must be between 1 and last day of month

However, the best way in Oracle to store “Date” data is within a Date data type column …

Storing Date Values As Characters Part II (A Better Future) May 30, 2016

Posted by Richard Foote in 12c, CBO, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.
5 comments

In the previous post, I discussed how storing date values within a character data type is a really really bad idea and illustrated how the CBO can easily get its costings totally wrong as a result. A function-based date index helped the CBO get the correct costings and protect the integrity of the date data.

During the demo, I re-collected statistics on the table as the associated hidden virtual column after creating the function-based index doesn’t have statistics.

Before re-collecting statistics:

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE';</pre>
 
COLUMN_NAME  NUM_DISTINCT DENSITY     HISTOGRAM      HID VIR
------------ ------------ ---------- --------------- --- ---
SYS_NC00004$                         NONE            YES YES
BOWIE_DATE          10000      .0001 NONE            NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

And afterwards:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');

PL/SQL procedure successfully completed. 

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column 
from dba_tab_cols where table_name='BOWIE'; 

COLUMN_NAME  NUM_DISTINCT DENSITY    HISTOGRAM       HID VIR 
------------ ------------ ---------- --------------- --- --- 
SYS_NC00004$        10000      .0001 NONE            YES YES 
BOWIE_DATE          10000      .0001 HYBRID          NO  NO 
CODE                 1000       .001 NONE            NO  NO 
ID                1000000    .000001 NONE            NO  NO

 

We can see that the hidden virtual column now has statistics.

But we also notice another difference, that being the BOWIE_DATE column now has a histogram (of type Hybrid).

As discussed in the previous post, the issue here is that the date data within the character column covers only a very specific subset of all the potential character values that could reside within the column. Therefore the CBO is getting the range scan selectivity hopelessly incorrect.

Now that we’ve run a few queries featuring the BOWIE_DATE column in the predicates and as there’s effectively data skew within the column, the column becomes a candidate for a histogram with the default SIZE AUTO collection method.

The histogram now provides the CBO with a much more accurate picture of the distribution of the data within the BOWIE_DATE and that between discrete “date” column values, there are only so many rows that qualify.

As a result of the histogram, the CBO can now make much more accurate cardinality estimates.

If we now re-run the query that actually returns 8300 rows but the CBO previously estimated only 100 rows be returned:

SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31'
                                             
8300 rows selected.
                                                            
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 4152 | 83040 |   1000 (12) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 4152 | 83040 |   1000 (12) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31')
Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
4063 consistent gets
0 physical reads
0 redo size
282075 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
8300 rows processed

We see that at an estimated 4,152 rows, it’s a much better estimate. Not perfect, but maybe good enough to now get the more efficient Full Table Scan execution plan.

If we re-run the query that returned over 1/2 the table at some 570,000 rows but with the CBO previously estimating only 116 rows:

SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31';</pre>
  
570800 rows selected.
    
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 572K |   10M |   1012 (13) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 572K |   10M |   1012 (13) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31')
   
Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
41456 consistent gets
4 physical reads
0 redo size
19292352 bytes sent via SQL*Net to client
419135 bytes received via SQL*Net from client
38055 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
570800 rows processed

 

We see that at an estimate of 572K rows, it’s now got this just about right and again has made the right decision with the Full Table Scan execution plan.

Storing date data in character based columns is still a really really bad idea and limits the manner in which date data can be analysed, protected and accessed, but with appropriate histograms in place, at least the CBO has some chance of making a reasonable fist of things with some range based queries.

As a follow-up, I was asked if storing dates in NUMBER format is a better option than as a string. I’ll discuss that next.

Storing Date Values As Characters (What’s Really Happening) May 26, 2016

Posted by Richard Foote in 12c, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.
1 comment so far

For something that’s generally considered an extremely bad idea, I’ve lost count of the number of times I’ve come across applications that insist on storing date values as characters within the database. We’ve all seen them …

I recently got called in to assist a customer who was having issues with a POC in relation to the database not appearing to want to use the In-Memory Database option as expected. In various key scenarios, the CBO kept coming up with execution plans that used index plans (they were hesitant to drop these particular indexes), when if it only just used the Database In-Memory store, the queries ran so much faster. So I was called in to find out what’s really happening and it turned out that the main culprit was indeed queries against columns where dates were stored as characters within the database. In the process, we found another issue with some “invalid” date values. Go figure.

Interestingly, both issues could be addressed by creating a new index on the date column …

I’ve kinda replicated the scenario here with the following little test case. I’ll begin by creating a table with a varchar2 field (bowie_date) that actually stores a whole bunch of “dates”:

SQL> create table bowie (id number, code number, bowie_date varchar2(42));
Table created

SQL> insert into bowie select rownum, mod(rownum,1000), to_char(sysdate-mod(rownum,10000), 'YYYY MM DD') from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

So the bowie_date column basically has 10000 different dates, with each date evenly distributed with 100 occurrences for each date.

I’ll now collect statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');
PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  NUM_DISTINCT DENSITY          HISTOGRAM HID VIR
------------ ------------ ---------- --------------- --- ---
BOWIE_DATE          10000      .0001 NONE            NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

 

So the bowie_date column indeed has 10000 distinct dates.

I’ll now create a standard b-tree index on this column:

SQL> create index bowie_date_i on bowie(bowie_date);
Index created.

So the data looks as follows with the bowie_date a varchar2 column that actually contains date data, with the following min/max ranges:

SQL> select * from bowie where rownum <11;

        ID       CODE BOWIE_DATE
---------- ---------- ----------
       916        916 2013 11 22
       917        917 2013 11 21
       918        918 2013 11 20
       919        919 2013 11 19
       920        920 2013 11 18
       921        921 2013 11 17
       922        922 2013 11 16
       923        923 2013 11 15
       924        924 2013 11 14
       925        925 2013 11 13

10 rows selected.

SQL> select min(bowie_date) min, max(bowie_date) max from bowie;

MIN        MAX
---------- ----------
1989 01 09 2016 05 26

If we run a query that uses an equality predicate as follows:

SQL> select * from bowie where bowie_date = '2015 06 13';
100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                             | Name         | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                      |              |  100 |  2000 |     103 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED   | BOWIE        |  100 | 2000  |     103 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                      | BOWIE_DATE_I |  100 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE"='2015 06 13')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
110 consistent gets
16 physical reads
0 redo size
4383 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed 

The CBO gets things just about right. 100 rows are estimated and indeed 100 rows are retrieved. So we have confidence the CBO has made the right decision in using the index here as so few rows are actually retrieved.

However, if we run a range scan predicate such as the following:

SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31';

8300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                          | Name         | Rows | Bytes | Cost (%CPU)|     Time |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                   |              |  100 | 2000  |     104 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE        |  100 | 2000  |     104 (0)| 00:00:01 |
|*2  | INDEX RANGE SCAN                   | BOWIE_DATE_I |  100 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
8881 consistent gets
38 physical reads
0 redo size
315219 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
8300 rows processed

The CBO has got the costings wrong here. It still estimates  only 100 rows are to be returned when in actual fact 8300 rows come back.

If we select an even larger “date” range:

SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31';

570800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1525056162

----------------------------------------------------------------------------------------------------
| Id | Operation                          | Name         | Rows | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                   |              |  116 |  2320 |     120 (0)| 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE        |  116 |  2320 |     120 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                   | BOWIE_DATE_I |  116 |       |       3 (0)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31')

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
610491 consistent gets
1704 physical reads
0 redo size
21575496 bytes sent via SQL*Net to client
419135 bytes received via SQL*Net from client
38055 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
570800 rows processed

The CBO has got things seriously wrong here. We’re actually returning over 1/2 the table, some 570,800 rows but the CBO thinks only 116 rows will be returned. Why ?

The problem comes back to storing date values as characters. The CBO has absolutely no idea that these “characters” are actually meant to be dates and has no idea that the only valid ranges of values are date values.

With a possible range between “1989 01 09” and “2016 05 26“, any character range/combination of values (up to the 42 column size) could potentially exist in this column ( value ‘1zxgs$.jKN6tglhasgdlhlhd23bkbk?k’ for example).

So the required range between ‘2000 10 10’ and ‘2016 12 31’ actually represents a relatively narrow range of possible values within the range of all possible values (especially as the leading column differs between the min/max).

Hence why the CBO is estimating such a low number of rows to be returned and hence why the CBO is deciding to incorrectly use the index. “Hiding” the meaning and distribution of values from the CBO in this manner can be problematic to say the least.

Worse of course is also the possibility of “invalid” dates being entered as the database has no implicit way to police the integrity of the data:

SQL> insert into bowie values (1000001, 42, '2016 6 31');

1 row created.

SQL> rollback;

Rollback complete.

We know there’s no such date as 31st June but Oracle has no idea that this is logically invalid data. Or the value ‘lh;ghsgdsd7gdGLH96bb’ for that matter …

Did I mention that storing dates in a character column is a really really bad idea …

Now there are a couple of ways to help address these issues if changing the column and its datatype is not possible due to the application. One way is to create a function-based index as follows on a date version of the column:

SQL> create index bowie_date_fn_i on bowie(to_date(bowie_date,'YYYY MM DD'));

Index created.

If we run the equality query but this time referencing the to_date function:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015';

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 960797537

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows  | Bytes  | Cost (%CPU)|     Time |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 | 10000 |   273K |     893 (2)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE           | 10000 |   273K |     893 (2)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | BOWIE_DATE_FN_I |  4000 |        |       3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
12 recursive calls
0 db block gets
113 consistent gets
16 physical reads
0 redo size
3268 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

We notice the CBO has got the row estimate way wrong here, thinking that 10000 rows, not 100 rows are to be returned. Why ?

Remember, when we create a function-based index, Oracle creates a hidden virtual column on the table as Oracle needs a way to store the statistics associated with the result set from the function. But these statistics aren’t populated until we next collect statistics on the table (or explicitly for just the hidden columns) and without the column statistics, the CBO can make poor assumptions:

SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col
umn, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  DATA_TYPE NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ --------- ------------ ---------- --------------- --- ---
SYS_NC00004$ DATE                              NONE            YES YES
BOWIE_DATE   VARCHAR2         10000      .0001 NONE            NO  NO
CODE         NUMBER            1000       .001 NONE            NO  NO
ID           NUMBER         1000000    .000001 NONE            NO  NO

There are no column statistics for the virtual column but we note the data type of the virtual column is DATE. So let’s collect new statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE');

PL/SQL procedure successfully completed.
         .
SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col
umn, virtual_column from dba_tab_cols where table_name='BOWIE';

COLUMN_NAME  DATA_TYPE NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ --------- ------------ ---------- --------------- --- ---
SYS_NC00004$ DATE             10000      .0001 NONE            YES YES
BOWIE_DATE   VARCHAR2         10000      .0001 HYBRID          NO  NO
CODE         NUMBER            1000       .001 NONE            NO  NO
ID           NUMBER         1000000    .000001 NONE            NO  NO

There are actually two key differences in the above statistics, but in this post I’ll just focus on the fact that the hidden virtual column now has associated statistics (I’ll discuss the other key difference in my next blog post).

When we re-run the query:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015';

100 rows selected.
Execution Plan
----------------------------------------------------------
Plan hash value: 960797537

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  100 |  2800 |     102 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE           |  100 |  2800 |     102 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | BOWIE_DATE_FN_I |  100 |       |       3 (0) | 00:00:01 |
-------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))
Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
108 consistent gets
0 physical reads
0 redo size
3268 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

The CBO has now got the estimate spot on.

If we now run the previous range scan query that accessed 1/2 the table, referencing the to_date function:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2000' and '31-DEC-2015';

556100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes  | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       |  556K |    14M |   1328 (34)| 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE |  556K |    14M |   1328 (34)| 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2000-10-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM
DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
40444 consistent gets
0 physical reads
0 redo size
18804277 bytes sent via SQL*Net to client
408355 bytes received via SQL*Net from client
37075 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
556100 rows processed

The CBO has got the estimates pretty well spot on and is now performing the far more efficient Full Table Scan. The CBO knows that the virtual column is of type DATE and therefore can much more accurately determine the actual cardinality estimates for the range scan on the “date” column.

If we now run the other corresponding range scan that returned a moderate number of rows:

SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2015' and '31-DEC-2015';

8300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes  | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 8401 |   229K |   1205 (27) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | BOWIE | 8401 |   229K |   1205 (27) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2015-10-10
00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM
DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
14 recursive calls
0 db block gets
4058 consistent gets
0 physical reads
0 redo size
282075 bytes sent via SQL*Net to client
6635 bytes received via SQL*Net from client
555 SQL*Net roundtrips to/from client
1 sorts (memory)
0 sorts (disk)
8300 rows processed

We notice that again the CBO has got the row estimate just about right and because the index has a poor clustering factor, the CBO still decided to go down the Full Table Scan path.

Even if we don’t use the index all that much (perhaps the Database In-Memory store is a better option for most queries) , it can still play an important role in policing the integrity of the data. An attempt to insert an invalid date will now automatically be captured by the database and fail:

SQL> insert into bowie values (1000001, 42, '2016 6 31');
insert into bowie values (1000001, 42, '2016 6 31')
*
ERROR at line 1:
ORA-01839: date not valid for month specified

As the function-based index can only be populated or updated if a valid date is inserted into the table, any attempt to insert an invalid date will fail thus protecting the integrity of the data.

The best “fix” here is to store these dates in a date field within the database, where the above issues are automatically addressed. If this is not possible, then the introduction and usage of an associated function-based index can certainly assist the CBO in making the correct decision.

However, if data integrity is not a concern (the application does a fantastic job of it and no one ever has to perform manual data fixes directly in the database), then there’s another option to help make the CBO do the right thing.

The clue is back in how I collected the table statistics, which I’ll discuss in my next post.

Index Advanced Compression: Multi-Column Index Part I (There There) September 17, 2015

Posted by Richard Foote in 12c, Advanced Index Compression, Concatenated Indexes, Index Rebuild, Oracle Indexes.
1 comment so far

I’ve discussed Index Advanced Compression here a number of times previously. It’s the really cool additional capability introduced to the Advanced Compression Option with 12.1.0.2, that not only makes compressing indexes a much easier exercise but also enables indexes to be compressed more effectively than previously possible.

Thought I might look at a multi-column index to highlight just how truly cool this new feature is in automatically managing the compression of indexes.

First, let’s create a little table and multi-column index:

SQL> create table bowie (id number, code number, name varchar2(42));

Table created.

SQL> insert into bowie select rownum, mod(rownum,10), 'DAVID BOWIE' from dual connect by level <= 1000000;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> create index bowie_idx on bowie(code, id) pctfree 0;

Index created.

OK, the key thing to note here is that the leading CODE column in the index only has 10 distinct values and so is repeated very frequently. However, the second ID column is effectively unique such that the index entry overall is also likewise effectively unique. I’ve created this index initially with no compression, but with a PCTFREE 0 to make the non-compressed index as small as possible.

If we look at the size of the index:

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         2361          2

We notice the index currently has 2361 leaf blocks.

I’ve previously discussed how index compression basically de-duplicates the indexed values by storing them in a pre-fixed table within the index leaf block. These pre-fixed entries are them referenced in the actual index entries, meaning it’s only now necessary to store repeated values once within a leaf block. Only repeated index values within an index leaf block can therefore be effectively compressed.

In this example, it would be pointless in compressing both indexed columns as this would only result in a unique pre-fixed entry for each any every index entry, given that the ID column is unique. In fact, the overhead of having the pre-fixed table for each and every index entry would actually result in a larger, not small overall index structure.

To show how compressing the whole index would be a really dumb idea for this particular index:

SQL> alter index bowie_idx rebuild compress;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         3120          2

The COMPRESS option basically compresses the whole index and we note that rather than creating a smaller, compressed index structure, the index is in fact bigger at 3120 leaf blocks.

However, as the leading CODE column in the index only has 10 distinct values and so is heavily repeated, it would make sense to just compress this first CODE column only in the index. This of course requires us to fully understand the data associated with the index.

We can do this by specifying just how many leading columns to compress (in this case just 1):

SQL> alter index bowie_idx rebuild compress 1;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         2002          2

We note the index is indeed smaller than it was originally, now at just 2002 leaf blocks.

So this requires us to make the correct decision in how many columns in the index to compress. Getting this wrong can result in a worse, not better overall index structure.

Now with Advanced Index Compression, we don’t have to make this decision, we can simply let Oracle do it for us. As discussed previously, Oracle can go through each leaf block and decide how to best compress each leaf block. In this case, it can automatically determine that it’s only beneficial to compress the CODE column throughout the index.

If we compress this index with the new COMPRESS ADVANCED LOW clause:

SQL> alter index bowie_idx rebuild compress advanced low;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         2002          2

We note we get the index at the nice, small 2002 leaf blocks, as if we used the correct COMPRESS 1 decision.

However, the story gets a little better than this …

Let’s now modify the contents of the table so that we create some duplicates also for the second ID column:

SQL> update bowie set id=42 where id between 442000 and 542000;

100001 rows updated.

SQL> commit;

Commit complete.

OK, so for about 10% of rows, the ID column value is indeed repeated with the value 42. However, for the remaining 90% of rows (and hence index entries), the ID column remains effectively unique. So we have this 10% section of the index where ID is indeed heavily repeated with the value 42, but everywhere else within the index the ID remain unique.

If we rebuild this index again with no compression:

SQL> alter index bowie_idx rebuild nocompress pctfree 0;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         2336          2

We now end up with 2336 leaf blocks (a little smaller than before the update as we’re replacing 10% of the IDs with a smaller value of just 42).

However, the vast majority (90%) of the index entries are still unique, so attempting to compress the entire index is again unlikely to be beneficial:

SQL> alter index bowie_idx rebuild compress;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         2946          2

Indeed, the index is again now bigger at 2946 than it was when it wasn’t compressed.

We can again effectively compress just the CODE column in the index:

SQL> alter index bowie_idx rebuild compress 1;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         1977          2

OK, just compressing the CODE column has indeed resulted in a smaller index structure (just 1977 leaf blocks) as it did before.

Without Advanced Index Compression we have the option to not compress the index (the result is average), compress both columns (the result is worse) or compress just the leading column (the result is better). It’s an all or nothing approach to index compression with the best method decided at the overall index level.

We don’t have the option to compress just the leading column when it makes sense to do so, but to also compress both columns in just the 10% portion of the index where it also makes sense to do so (when we have lots of repeating 42 values for ID).

We do have this option though with Advanced Index Compression and indeed this is performed automatically by Oracle in just those leaf blocks where it’s beneficial because the decision on how to compress an index is not performed at the overall index level but at the leaf block level. As such, Advanced Index Compression has the potential to compress an index in a manner that was simply not possible previously:

SQL> alter index bowie_idx rebuild compress advanced low;

Index altered.

SQL> select index_name, leaf_blocks, blevel from dba_indexes where table_name='BOWIE';

INDEX_NAME LEAF_BLOCKS     BLEVEL
---------- ----------- ----------
BOWIE_IDX         1941          2

We notice the index is now even smaller at just 1941 leaf blocks than it was when just compressing the leading column as we now also compress the CODE column in just that 10% of the table where we also had repeating ID values.

I can’t emphasise enough just how cool this feature is !!

In fact, I would recommend something I don’t usually recommend and that is rebuilding all your indexes at least once (where you know the leading column has some repeated values) with the Advanced Index Compression option, so that all indexes can be compressed to their optimal manner.

Note though that this does require the Advanced Compression Option !!

More later 🙂

Index Tree Dumps in Oracle 12c Database (New Age) June 22, 2015

Posted by Richard Foote in 12c, TreeDumps.
add a comment

I’ve previously discussed Index Tree Dumps but I’ve recently found a nice little improvement that’s been introduced in Oracle Database 12c.

Let’s begin by creating a little table and index:

SQL> create table bowie (id number, name varchar2(42));

Table created.

SQL> insert into bowie select rownum, 'DAVID BOWIE' from dual connect by level <=10000;

10000 rows created.

SQL> commit;

Commit complete.

SQL> create index bowie_id_i on bowie(id);

Index created.

To generate an Index Tree Dump, we first need the OBJECT_ID of the index:

SQL> select object_id from dba_objects where object_name='BOWIE_ID_I';

OBJECT_ID
----------
98829

And then use it to generate the Index Tree Dump:

SQL> alter session set events 'immediate trace name treedump level 98829';

Session altered.

Previously, an Index Tree Dump looked like the following:

—– begin tree dump
branch: 0x100023b 16777787 (0: nrow: 21, level: 1)
leaf: 0x100023c 16777788 (-1: nrow: 485 rrow: 485)
leaf: 0x100023d 16777789 (0: nrow: 479 rrow: 479)
leaf: 0x100023e 16777790 (1: nrow: 479 rrow: 479)
leaf: 0x100023f 16777791 (2: nrow: 479 rrow: 479)
leaf: 0x1000240 16777792 (3: nrow: 479 rrow: 479)
leaf: 0x1000241 16777793 (4: nrow: 479 rrow: 479)
leaf: 0x1000242 16777794 (5: nrow: 479 rrow: 479)
leaf: 0x1000243 16777795 (6: nrow: 479 rrow: 479)
leaf: 0x1000244 16777796 (7: nrow: 479 rrow: 479)
leaf: 0x1000245 16777797 (8: nrow: 479 rrow: 479)
leaf: 0x1000246 16777798 (9: nrow: 479 rrow: 479)
leaf: 0x1000247 16777799 (10: nrow: 479 rrow: 479)
leaf: 0x1000249 16777801 (11: nrow: 479 rrow: 479)
leaf: 0x100024a 16777802 (12: nrow: 479 rrow: 479)
leaf: 0x100024b 16777803 (13: nrow: 479 rrow: 479)
leaf: 0x100024c 16777804 (14: nrow: 479 rrow: 479)
leaf: 0x100024d 16777805 (15: nrow: 479 rrow: 479)
leaf: 0x100024e 16777806 (16: nrow: 479 rrow: 479)
leaf: 0x100024f 16777807 (17: nrow: 479 rrow: 479)
leaf: 0x1000250 16777808 (18: nrow: 479 rrow: 479)
leaf: 0x1000251 16777809 (19: nrow: 414 rrow: 414)
—– end tree dump

So this index is a Level 1 Index with a root block and 21 Leaf Blocks. The first entry always corresponds to the index root block and is followed by the 21 leaf blocks. Each leaf block entry details the relative block address, the sequence number, the number of index entries (nrow) and the number of non-deleted index entries (rrow).

If we look at the same Index Tree Dump in 12c (12.0.1.2):

branch: 0x180017b 25166203 (0: nrow: 21, level: 1)
leaf: 0x180017c 25166204 (-1: row:485.485 avs:828)
leaf: 0x180017d 25166205 (0: row:479.479 avs:820)
leaf: 0x180017e 25166206 (1: row:479.479 avs:820)
leaf: 0x180017f 25166207 (2: row:479.479 avs:820)
leaf: 0x18004c8 25167048 (3: row:479.479 avs:820)
leaf: 0x18004c9 25167049 (4: row:479.479 avs:819)
leaf: 0x18004ca 25167050 (5: row:479.479 avs:820)
leaf: 0x18004cb 25167051 (6: row:479.479 avs:820)
leaf: 0x18004cc 25167052 (7: row:479.479 avs:820)
leaf: 0x18004cd 25167053 (8: row:479.479 avs:819)
leaf: 0x18004ce 25167054 (9: row:479.479 avs:820)
leaf: 0x18004cf 25167055 (10: row:479.479 avs:820)
leaf: 0x18004d1 25167057 (11: row:479.479 avs:820)
leaf: 0x18004d2 25167058 (12: row:479.479 avs:820)
leaf: 0x18004d3 25167059 (13: row:479.479 avs:819)
leaf: 0x18004d4 25167060 (14: row:479.479 avs:820)
leaf: 0x18004d5 25167061 (15: row:479.479 avs:820)
leaf: 0x18004d6 25167062 (16: row:479.479 avs:820)
leaf: 0x18004d7 25167063 (17: row:479.479 avs:820)
leaf: 0x18004d8 25167064 (18: row:479.479 avs:819)
leaf: 0x18004d9 25167065 (19: row:414.414 avs:1795)
—– end tree dump

We notice the format is a little different in that it also now includes the avs (free space) within the leaf block as well.

If we now delete a few rows (and hence index entries) and look at the updated tree dump:

SQL> delete bowie where id between 1 and 400;

400 rows deleted.

SQL> commit;

Commit complete.

SQL> alter session set events 'immediate trace name treedump level 98829';

Session altered.

branch: 0x180017b 25166203 (0: nrow: 21, level: 1)
leaf: 0x180017c 25166204 (-1: row:485.85 avs:828)
leaf: 0x180017d 25166205 (0: row:479.479 avs:820)
leaf: 0x180017e 25166206 (1: row:479.479 avs:820)
leaf: 0x180017f 25166207 (2: row:479.479 avs:820)
leaf: 0x18004c8 25167048 (3: row:479.479 avs:820)
leaf: 0x18004c9 25167049 (4: row:479.479 avs:819)
leaf: 0x18004ca 25167050 (5: row:479.479 avs:820)
leaf: 0x18004cb 25167051 (6: row:479.479 avs:820)
leaf: 0x18004cc 25167052 (7: row:479.479 avs:820)
leaf: 0x18004cd 25167053 (8: row:479.479 avs:819)
leaf: 0x18004ce 25167054 (9: row:479.479 avs:820)
leaf: 0x18004cf 25167055 (10: row:479.479 avs:820)
leaf: 0x18004d1 25167057 (11: row:479.479 avs:820)
leaf: 0x18004d2 25167058 (12: row:479.479 avs:820)
leaf: 0x18004d3 25167059 (13: row:479.479 avs:819)
leaf: 0x18004d4 25167060 (14: row:479.479 avs:820)
leaf: 0x18004d5 25167061 (15: row:479.479 avs:820)
leaf: 0x18004d6 25167062 (16: row:479.479 avs:820)
leaf: 0x18004d7 25167063 (17: row:479.479 avs:820)
leaf: 0x18004d8 25167064 (18: row:479.479 avs:819)
leaf: 0x18004d9 25167065 (19: row:414.414 avs:1795)
—– end tree dump

We notice that it now correctly details how many non-deleted index entries we now have in the first leaf block (85). Unfortunately, the free space remains the same and doesn’t take into account the deleted index entries (still recorded as 828 bytes).

Of course, if we perform any additional DML that impacts this leaf block such as another delete:

SQL> delete bowie where id=401;

1 row deleted.

SQL> commit;

Commit complete.

SQL> alter session set events 'immediate trace name treedump level 98829';

Session altered.

—– begin tree dump
branch: 0x180017b 25166203 (0: nrow: 21, level: 1)
leaf: 0x180017c 25166204 (-1: row:85.84 avs:6725)
leaf: 0x180017d 25166205 (0: row:479.479 avs:820)
leaf: 0x180017e 25166206 (1: row:479.479 avs:820)
leaf: 0x180017f 25166207 (2: row:479.479 avs:820)
leaf: 0x18004c8 25167048 (3: row:479.479 avs:820)
leaf: 0x18004c9 25167049 (4: row:479.479 avs:819)
leaf: 0x18004ca 25167050 (5: row:479.479 avs:820)
leaf: 0x18004cb 25167051 (6: row:479.479 avs:820)
leaf: 0x18004cc 25167052 (7: row:479.479 avs:820)
leaf: 0x18004cd 25167053 (8: row:479.479 avs:819)
leaf: 0x18004ce 25167054 (9: row:479.479 avs:820)
leaf: 0x18004cf 25167055 (10: row:479.479 avs:820)
leaf: 0x18004d1 25167057 (11: row:479.479 avs:820)
leaf: 0x18004d2 25167058 (12: row:479.479 avs:820)
leaf: 0x18004d3 25167059 (13: row:479.479 avs:819)
leaf: 0x18004d4 25167060 (14: row:479.479 avs:820)
leaf: 0x18004d5 25167061 (15: row:479.479 avs:820)
leaf: 0x18004d6 25167062 (16: row:479.479 avs:820)
leaf: 0x18004d7 25167063 (17: row:479.479 avs:820)
leaf: 0x18004d8 25167064 (18: row:479.479 avs:819)
leaf: 0x18004d9 25167065 (19: row:414.414 avs:1795)
—– end tree dump

We notice the leaf block has now cleaned out the previously deleted index entries and the free space has been updated accordingly (now 6725 bytes).

Showing the amount of free space within a block is a nice little improvement to the format of the index tree dump.

Oracle Database In-Memory Test Drive Workshop: Canberra 28 April 2015 March 30, 2015

Posted by Richard Foote in 12c, In-Memory.
4 comments

I’ll be running a free Oracle Database In-Memory Test Drive Workshop locally here in Canberra on Tuesday, 28th April 2015.

Just bring a laptop with at least 8G of RAM and I’ll supply a VirtualBox image with the Oracle Database 12c In-Memory environment. Together we’ll go through a number of hands-on labs that cover:

  1. Configuring the Product Easily
  2. Understanding Fast Table Scans (with none of those pesky indexes)
  3. Understanding Query Optimisation
  4. Understanding Transactional Consistency

It’s sure to be a fun morning. It’s also sure to fill up really quickly so please register ASAP to avoid disappointment.

For all the necessary details including how to register click here.

Hope to see you then 🙂

UPDATE: This event is now officially FULL. Sorry to disappoint if you haven’t yet enrolled.

12.1.0.2 Introduction to Zone Maps Part III (Little By Little) November 24, 2014

Posted by Richard Foote in 12c, Attribute Clustering, Oracle Indexes, Zone Maps.
1 comment so far

I’ve previously discussed the new Zone Map database feature and how they work in a similar manner to Exadata Storage indexes.

Just like Storage Indexes (and conventional indexes for that manner), they work best when the data is well clustered in relation to the Zone Map or index. By having the data in the table ordered in the same manner as the Zone Map, the ranges of the min/max values for each 8M “zone” in the table can be as narrow as possible, making them more likely to eliminate zone accesses.

On the other hand, if the data in the table is not well clustered, then the min/max ranges within the Zone Map can be extremely wide, making their effectiveness limited.

In my previous example on the ALBUM_ID column in my first article on this subject, the data was extremely well clustered and so the associated Zone Map was very effective. But what if the data is poorly clustered ?

To illustrate, I’m going to create a Zone Map based on the poorly clustered ARTIST_ID column, which has its values randomly distributed throughout the whole table:

SQL> create materialized zonemap big_bowie_artist_id_zm on big_bowie(artist_id);
 create materialized zonemap big_bowie_artist_id_zm on big_bowie(artist_id)
 *
 ERROR at line 1:
 ORA-31958: fact table "BOWIE"."BIG_BOWIE" already has a zonemap
 "BOWIE"."BIG_BOWIE_ALBUM_ID_ZM" on it

Another difference between an index and Zone Map is that there can only be the one Zone Map defined per table, but a Zone Map can include multiple columns. As I already have a Zone Map defined on just the ALBUM_ID column, I can’t just create another.

So I’ll drop the current Zone Map and create a new one based on both the ARTIST_ID and ALBUM_ID columns:

SQL> drop materialized zonemap big_bowie_album_id_zm;

Materialized zonemap dropped.

SQL> create materialized zonemap big_bowie_zm on big_bowie(album_id, artist_id);

Materialized zonemap created.
    
 SQL> select measure, position_in_select, agg_function, agg_column_name
 from dba_zonemap_measures where zonemap_name='BIG_BOWIE_ZM';

MEASURE              POSITION_IN_SELECT AGG_FUNCTION  AGG_COLUMN_NAME
 -------------------- ------------------ ------------- --------------------
 "BOWIE"."BIG_BOWIE".                  5 MAX           MAX_2_ARTIST_ID
 "ARTIST_ID"

"BOWIE"."BIG_BOWIE".                  4 MIN           MIN_2_ARTIST_ID
 "ARTIST_ID"

"BOWIE"."BIG_BOWIE".                  3 MAX           MAX_1_ALBUM_ID
 "ALBUM_ID"

"BOWIE"."BIG_BOWIE".                  2 MIN           MIN_1_ALBUM_ID
 "ALBUM_ID"

So this new Zone Map has min/max details on each zone in the table for both the ARTIST_ID and ALBUM_ID columns.

The min/max ranges of a Zone Map provides an excellent visual representation of the clustering of the data. If I select Zone Map details of the ALBUM_ID column (see partial listing below):

SQL> select zone_id$, min_1_album_id, max_1_album_id, zone_rows$ from big_bowie_zm;

ZONE_ID$ MIN_1_ALBUM_ID MAX_1_ALBUM_ID ZONE_ROWS$
 ---------- -------------- -------------- ----------
 3.8586E+11              1              2      66234
 3.8586E+11              5              6      56715
 3.8586E+11              7              7      76562
 3.8586E+11              7              8      76632
 3.8586E+11              8              9      76633
 3.8586E+11             21             22      75615
 3.8586E+11             29             29      75582
 3.8586E+11             31             32      75545
 3.8586E+11             35             36      75617
 3.8586E+11             43             44      75615
 ...

3.8586E+11             76             77      75615
 3.8586E+11             79             80      75615
 3.8586E+11             86             87      75616
 3.8586E+11             88             89      75618
 3.8586E+11             97             97      75771
 3.8586E+11            100            100      15871

134 rows selected.

As the data in the table is effectively ordered based on the ALBUM_ID column (and so is extremely well clustered in relation to this column), the min/max ranges for each zone is extremely narrow. Each zone basically only contains one or two different values of ALBUM_ID and so if I’m just after a specific ALBUM_ID value, the Zone Map is very effective in eliminating zones from having to be accessed. Just what we want.

However, if we look at the Zone Map details of the poorly clustered ARTIST_ID column (again just a partial listing):

SQL> select zone_id$, min_2_artist_id, max_2_artist_id, zone_rows$ from big_bowie_zm;

ZONE_ID$ MIN_2_ARTIST_ID MAX_2_ARTIST_ID ZONE_ROWS$
 ---------- --------------- --------------- ----------
 3.8586E+11            3661           98244      66234
 3.8586E+11               1          100000      56715
 3.8586E+11            5273           81834      76562
 3.8586E+11               1          100000      76632
 3.8586E+11               1          100000      76633
 3.8586E+11               1          100000      75615
 3.8586E+11            2383           77964      75582
 3.8586E+11               1          100000      75545
 3.8586E+11               1          100000      75617
 3.8586E+11               1          100000      75615
 ...

3.8586E+11               1          100000      75615
 3.8586E+11               1          100000      75615
 3.8586E+11               1          100000      75615
 3.8586E+11               1          100000      75615
 3.8586E+11               1          100000      75616
 3.8586E+11               1          100000      75618
 3.8586E+11            4848           80618      75771
 3.8586E+11           84130          100000      15871

134 rows selected.

We notice the ranges for most of the zones is extremely large, with many actually having a min value of 1 (the actual minimum) and a max of 100000 (the actual maximum). This is a worst case scenario as a specific required value could potentially reside in most of the zones, thereby  forcing Oracle to visit most zones and making the Zone Map totally ineffective.

If we run a query searching for a specific ARTIST_ID:

SQL> select * from big_bowie where artist_id=42;

100 rows selected.

Elapsed: 00:00:00.69

Execution Plan
 ----------------------------------------------------------
 Plan hash value: 1980960934

----------------------------------------------------------------------------------------------------
 | Id  | Operation                              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
 ----------------------------------------------------------------------------------------------------
 |   0 | SELECT STATEMENT                       |           |    99 |  9108 |  3291  (13)| 00:00:01 |
 |*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| BIG_BOWIE |    99 |  9108 |  3291  (13)| 00:00:01 |
 ----------------------------------------------------------------------------------------------------
 Predicate Information (identified by operation id):
 ---------------------------------------------------

1 - storage("ARTIST_ID"=42)
 filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
 BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_2_ARTIST_ID" > :1 OR
 zm."MAX_2_ARTIST_ID" < :2) THEN 3 ELSE 2 END END FROM "BOWIE"."BIG_BOWIE_ZM" zm WHERE
 zm."ZONE_LEVEL$"=0 ORDER BY zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),42,42)<3 AND
 "ARTIST_ID"=42)

      
 Statistics
 ----------------------------------------------------------
 141  recursive calls
 0  db block gets
 101614  consistent gets
 0  physical reads
 0  redo size
 5190  bytes sent via SQL*Net to client
 618  bytes received via SQL*Net from client
 8  SQL*Net roundtrips to/from client
 0  sorts (memory)
 0  sorts (disk)
 100  rows processed

We notice we are forced to perform a very high number of consistent gets (101,614) when returning just 100 rows, much higher than the 2,364 consistent gets required to return a full 100,000 rows for a specific ALBUM_ID and not far from the 135,085 consistent gets when performing a full table scan.

We need to improve the performance of these queries based on the ARTIST_ID column …

Let’s drop this zone map:

SQL> drop materialized zonemap big_bowie_zm;

Materialized zonemap dropped.

and change the physical clustering of the data in the table so that the data is primarily now clustered in ARTIST_ID order:

 

SQL> alter table big_bowie add clustering by linear order(artist_id, album_id) with materialized zonemap;

Table altered.

So we have added a clustering attribute to this table (previously discussed here) and based a new Zone Map on this clustering at the same time.

SQL> select zonemap_name from dba_zonemaps where fact_table='BIG_BOWIE';

ZONEMAP_NAME
---------------
ZMAP$_BIG_BOWIE

SQL> select zonemap_name, pruning, with_clustering, invalid, stale, unusable
from dba_zonemaps where zonemap_name = 'ZMAP$_BIG_BOWIE';

ZONEMAP_NAME    PRUNING  WITH_CLUSTERING INVALID STALE   UNUSABLE
--------------- -------- --------------- ------- ------- --------
ZMAP$_BIG_BOWIE ENABLED  YES             NO      NO      NO

However, as we haven’t actually reorganized the table, the rows in the table are still clustered the same as before:

SQL> select zone_id$, min_2_album_id, max_2_album_id, zone_rows$ from zmap$_big_bowie;

ZONE_ID$ MIN_2_ALBUM_ID MAX_2_ALBUM_ID ZONE_ROWS$
---------- -------------- -------------- ----------
3.8586E+11             43             44      75615
3.8586E+11              1              2      66234
3.8586E+11             81             82      75615
3.8586E+11             29             29      75582
3.8586E+11             50             50      75481
3.8586E+11             90             91      75484
3.8586E+11              5              6      56715
3.8586E+11              7              8      76632
3.8586E+11              8              9      76633
3.8586E+11             16             16      75481
...

3.8586E+11             44             44      75480
3.8586E+11             82             83      75616
3.8586E+11            100            100      15871
3.8586E+11             34             35      75576
3.8586E+11             14             15      75615
3.8586E+11             33             34      75616
3.8586E+11              3              5      75707

134 rows selected.

SQL> select zone_id$, min_1_artist_id, max_1_artist_id, zone_rows$ from zmap$_big_bowie;

ZONE_ID$ MIN_1_ARTIST_ID MAX_1_ARTIST_ID ZONE_ROWS$
---------- --------------- --------------- ----------
3.8586E+11               1          100000      75545
3.8586E+11               1          100000      75616
3.8586E+11               1          100000      75617
3.8586E+11               1          100000      75911
3.8586E+11               1          100000      75616
3.8586E+11               1          100000      75616
3.8586E+11               1          100000      75615
3.8586E+11               1          100000      75616
3.8586E+11             132           75743      75612
3.8586E+11               1          100000      75615
...

3.8586E+11               1          100000      66296
3.8586E+11               1          100000      75615
3.8586E+11            2360           96960      75701
3.8586E+11               1          100000      75615
3.8586E+11               1          100000      75616
3.8586E+11           23432           98911      75480
3.8586E+11               1          100000      75791
3.8586E+11           21104           96583      75480

134 rows selected.

But if we now reorganise the table so that the clustering attribute can take effect:

SQL> alter table big_bowie move;

Table altered.

We notice the characteristics of the Zone Map has change dramatically. The previously well clustered ALBUM_ID now has a totally ineffective Zone Map with all the ranges effectively consisting of the full min/max values:

SQL> select zone_id$, min_2_album_id, max_2_album_id, zone_rows$ from zmap$_big_bowie;

ZONE_ID$ MIN_2_ALBUM_ID MAX_2_ALBUM_ID ZONE_ROWS$
---------- -------------- -------------- ----------
3.9704E+11              1            142      21185
3.9704E+11              1            100       9452
3.9704E+11              1            100      76516
3.9704E+11              1            100      75501
3.9704E+11              1            100      75497
3.9704E+11              1            100      75501
3.9704E+11              1            100      75499
3.9704E+11              1            100      75504
3.9704E+11              1            100      75500
3.9704E+11              1            100      75501
...

3.9704E+11              1            100      75503
3.9704E+11              1            100      75498
3.9704E+11              1            100      75501
3.9704E+11              1            100      75501
3.9704E+11              1            100      75501
3.9704E+11              1            100      75501
3.9704E+11              1            100      75794

144 rows selected.

While the previously ineffective Zone Map on the ARTIST_ID column is now much more effective with significantly smaller min/max ranges for each zone:

SQL> select zone_id$, min_1_artist_id, max_1_artist_id, zone_rows$ from zmap$_big_bowie;

ZONE_ID$ MIN_1_ARTIST_ID MAX_1_ARTIST_ID ZONE_ROWS$
---------- --------------- --------------- ----------
3.9704E+11              67            1036      21185
3.9704E+11            2359            2453       9452
3.9704E+11            8341            9106      76516
3.9704E+11           18933           19688      75501
3.9704E+11           22708           23463      75497
3.9704E+11           26483           27238      75501
3.9704E+11           27238           27993      75499
3.9704E+11           33278           34033      75504
3.9704E+11           36674           40449      75500
3.9704E+11           38563           39318      75501
...

3.9704E+11           49888           50643      75503
3.9704E+11           62723           63478      75498
3.9704E+11           77824           78579      75501
3.9704E+11           82354           83109      75501
3.9704E+11           88394           89149      75501
3.9704E+11           93679           94434      75501
3.9704E+11           98211           98969      75794

144 rows selected.

The same query now runs so much faster as the Zone Map can eliminate almost all zones from being accessed:

SQL> select * from big_bowie where artist_id=42;

100 rows selected.

Elapsed: 00:00:00.02

Execution Plan
----------------------------------------------------------
Plan hash value: 1980960934

----------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |           |    99 |  9108 |  3291  (13)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| BIG_BOWIE |    99 |  9108 |  3291  (13)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("ARTIST_ID"=42)
filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_ARTIST_ID" > :1 OR
zm."MAX_1_ARTIST_ID" < :2) THEN 3 ELSE 2 END END FROM "BOWIE"."ZMAP$_BIG_BOWIE" zm WHERE
zm."ZONE_LEVEL$"=0 ORDER BY zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),42,42)<3 AND
"ARTIST_ID"=42)
Statistics
----------------------------------------------------------
187  recursive calls
0  db block gets
175  consistent gets
0  physical reads
0  redo size
5190  bytes sent via SQL*Net to client
618  bytes received via SQL*Net from client
8  SQL*Net roundtrips to/from client
9  sorts (memory)
0  sorts (disk)
100  rows processed

Consistent gets has reduced dramatically down to just 175 from the previously massive 101,614.

As is common with changing the clustering of data, what improves one thing makes something else significantly worse. The previously efficient accesses based on the ALBUM_ID column is now nowhere near as efficient as before:

SQL> select * from big_bowie where album_id = 42;

100000 rows selected.

Elapsed: 00:00:01.27

Execution Plan
----------------------------------------------------------
Plan hash value: 1980960934

----------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |           |   100K|  8984K|  3269  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| BIG_BOWIE |   100K|  8984K|  3269  (12)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("ALBUM_ID"=42)
filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_2_ALBUM_ID" > :1 OR
zm."MAX_2_ALBUM_ID" < :2) THEN 3 ELSE 2 END END FROM "BOWIE"."ZMAP$_BIG_BOWIE" zm WHERE
zm."ZONE_LEVEL$"=0 ORDER BY zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),42,42)<3 AND "ALBUM_ID"=42)

Statistics
----------------------------------------------------------
187  recursive calls
0  db block gets
141568  consistent gets
0  physical reads
0  redo size
4399566  bytes sent via SQL*Net to client
73878  bytes received via SQL*Net from client
6668  SQL*Net roundtrips to/from client
9  sorts (memory)
0  sorts (disk)
100000  rows processed

We now have to perform a whopping 141,568 consistent gets up from the previous 2,364 consistent gets.

So Zone Maps, like database indexes and Exadata Storage Indexes, can be extremely beneficial in reducing I/O but their effectiveness is very much dependant on the clustering of the underlining data.

Index Advanced Compression vs. Bitmap Indexes (Candidate) October 31, 2014

Posted by Richard Foote in 12c, Advanced Index Compression, Bitmap Indexes, Oracle Indexes.
7 comments

A good question from Robert Thorneycroft I thought warranted its own post. He asked:

I have a question regarding bitmapped indexes verses index compression. In your previous blog titled ‘So What Is A Good Cardinality Estimate For A Bitmap Index Column ? (Song 2)’ you came to the conclusion that ‘500,000 distinct values in a 1 million row table’ would still be a viable scenario for deploying bitmapped indexes over non-compressed b-tree indexes.

Now b-tree index compression is common, especially with the release of Advanced Index Compression how does this affect your conclusion? Are there still any rules of thumb which can be used to determine when to deploy bitmapped indexes instead of compressed b-tree indexes or has index compression made bitmapped indexes largely redundant?”

 

If you’re not familiar with Bitmap Indexes, it might be worth having a read of my previous posts on the subject.

Now Advanced Index Compression introduced in 12.1.0.2 has certainly made compressing indexes a lot easier and in many scenarios, more efficient than was previously possible. Does that indeed mean Bitmap Indexes, that are relatively small and automatically compressed, are now largely redundant ?

The answer is no, Bitmap Indexes are still highly relevant in Data Warehouse environments as they have a number of key advantages in the manner they get compressed over B-Tree Indexes.

Compression of a B-Tree index is performed within a leaf block where Oracle effectively de-duplicates the index entries (or parts thereof). This means that a highly repeated index value might need to be stored repeatedly in each leaf block. Bitmap index entries on the other hand can potentially span the entire table and only need to be split if the overall size of the index entries exceeds 1/2 a block. Therefore, the number of indexed values stored in a Bitmap Index can be far less than with a B-tree.

However, it’s in the area of storing the associated rowids where Bitmap Indexes can have the main advantage. With a B-tree index, even when highly compressed, each and every index entry must have an associated rowid stored in the index. If you have say 1 million index entries, that’s 1 million rowids that need to be stored, regardless of the compression ratio. With a Bitmap Index, an index entry has 2 rowids to specify the range of rows covered by the index entry, but this might be sufficient to cover the entire table. So depending on the number of distinct values being indexed in say a million row table, there may be dramatically fewer than 1 million rowids stored in the Bitmap Index.

To show how Bitmap Indexes are generally much smaller than corresponding compressed B-Tree indexes, a few simple examples.

In example 1, I’m going to create a B-Tree Index that is perfect candidate for compression. This index has very large indexed values that are all duplicates and so will compress very effectively:

SQL> create table ziggy (id number, weird varchar2(100));

Table created.

SQL> insert into ziggy select rownum, 'THE RISE AND FALL OF ZIGGY STARDUST AND THE SPIDERS FROM MARS'
     from dual connect by level <= 1000000;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> create index ziggy_weird_i on ziggy(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY_WEIRD_I';

INDEX_NAME        BLEVEL LEAF_BLOCKS   NUM_ROWS
------------- ---------- ----------- ----------
ZIGGY_WEIRD_I          2        9175    1000000

SQL> drop index ziggy_weird_i2;

Index dropped.

SQL> create index ziggy_weird_i on ziggy(weird) pctfree 0 compress advanced low;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY_WEIRD_I';

INDEX_NAME        BLEVEL LEAF_BLOCKS   NUM_ROWS
------------- ---------- ----------- ----------
ZIGGY_WEIRD_I          2        1389    1000000

 

So this index has compressed down from 9175 leaf blocks to just 1389. That’s impressive.

However, this scenario is also the perfect case for a Bitmap Index with large, highly repeated index entries. If we compare the compressed B-Tree Index with a corresponding Bitmap index:

SQL> create bitmap index ziggy_weird_i on ziggy(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY_WEIRD_I';

INDEX_NAME        BLEVEL LEAF_BLOCKS   NUM_ROWS
------------- ---------- ----------- ----------
ZIGGY_WEIRD_I          1          21         42

 

At just a tiny 21 leaf blocks, the Bitmap Index wins by a mile.

In example 2, I’m going to create an index that still almost a perfect case for compressing a B-Tree Index, but far less so for a Bitmap Index. I’m going to create enough duplicate entries to just about fill a specific leaf block, so that each leaf block only has 1 or 2 distinct index values. However, as we’ll have many more distinct indexed values overall, this means we’ll need more index entries in the corresponding Bitmap Index.

SQL> create table ziggy2 (id number, weird varchar2(100));

Table created.

SQL> insert into ziggy2 select rownum, 'THE RISE AND FALL OF ZIGGY STARDUST AND THE SPIDERS FROM MARS'||mod(rownum,1385)
     from dual connect by level<=1000000;

1000000 rows created.

SQL> commit;

Commit complete.
SQL> create index ziggy2_weird_i on ziggy2(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY2_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY2_WEIRD_I          2        9568    1000000

SQL> drop index ziggy2_weird_i;

Index dropped.

SQL> create index ziggy2_weird_i on ziggy2(weird) pctfree 0 compress advanced low;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY2_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY2_WEIRD_I          2        1401    1000000

 

So we have a relatively large indexed column that has some 1385 distinct values but each value just about fills out a compress leaf block. If we look at the compression of the index, we have reduced the index down from 9568 leaf blocks to just 1401 leaf blocks. Again, a very impressive compression ratio.

Unlike the previous example where we had just the one value, we now have some 1385 index entries that need to be created as a minimum for our Bitmap Index. So how does it compare now ?

SQL> drop index ziggy2_weird_I;

Index dropped.

SQL> create bitmap index ziggy2_weird_i on ziggy2(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY2_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY2_WEIRD_I          2         462       1385

 

Although the Bitmap Index is much larger than it was in the previous example, at just 464 leaf blocks it’s still significantly smaller than the corresponding compressed 1401 leaf block B-Tree index.

OK, example 3, we’re going to go into territory where no Bitmap Index should tread (or so many myths would suggest). We going to index a column in which each value only has the one duplicate. So for our 1 million row table, the column will have some 500,000 distinct values.

With relatively few duplicate column values, the compression of our B-Tree Indexes is not going to be as impressive. However, because the indexed values are still relatively large, any reduction here would likely have some overall impact:

SQL> create table ziggy3 (id number, weird varchar2(100));

Table created.

SQL> insert into ziggy3 select rownum, 'THE RISE AND FALL OF ZIGGY STARDUST AND THE SPIDERS FROM MARS'||mod(rownum,500000)
     from dual connect by level<=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> create index ziggy3_weird_i on ziggy3(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY3_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY3_WEIRD_I          2        9891    1000000

SQL> drop index ziggy3_weird_i;

Index dropped.

SQL> create index ziggy3_weird_i on ziggy3(weird) pctfree 0 compress advanced low;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY3_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY3_WEIRD_I          2        6017    1000000

 

So the compression ratio is not as good now, coming down to 6017 leaf blocks from 9891. However, this will surely be better than a Bitmap Index with 500,000 distinct values …

 

SQL> drop index ziggy3_weird_i;

Index dropped.

SQL> create bitmap index ziggy3_weird_i on ziggy3(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY3_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY3_WEIRD_I          2        5740     500000

 

So even in this extreme example, the Bitmap Index at 5740 leaf blocks is still smaller than the corresponding compressed B-Tree Index at 6017 leaf blocks.

In this last example 4, it’s a scenario similar to the last one, except the index entries themselves are going to be much smaller (a few byte number column vs. the 60 odd byte varchar2). Therefore, the rowids of the index entries will be a much larger proportion of the overall index entry size. Reducing the storage of index values via compression will be far less effective, considering the prefix table in a compressed index comes with some overhead.

SQL> create table ziggy4 (id number, weird number);

Table created.

SQL> insert into ziggy4 select rownum, mod(rownum,500000) from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

SQL> create index ziggy4_weird_i on ziggy4(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY4_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY4_WEIRD_I          2        1998    1000000

SQL> drop index ziggy4_weird_i;

Index dropped.

SQL> create index ziggy4_weird_i on ziggy4(weird) pctfree 0 compress advanced low;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY4_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY4_WEIRD_I          2        1998    1000000

 

So Index Advanced Compression has decided against compressing this index, it’s just not worth the effort. If we force compression:

 

SQL> drop index ziggy4_weird_i;

Index dropped.

SQL> create index ziggy4_weird_i on ziggy4(weird) pctfree 0 compress;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY4_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY4_WEIRD_I          2        2065    1000000

 

We notice the index has actually increased in size, up to 2065 leaf blocks from 1998. The overheads of the prefix table over-ride the small efficiencies of reducing the duplicate number indexed values.

Meanwhile the corresponding Bitmap Index:

SQL> drop index ziggy4_weird_i;

Index dropped.

SQL> create bitmap index ziggy4_weird_i on ziggy4(weird) pctfree 0;

Index created.

SQL> select index_name, blevel, leaf_blocks, num_rows from dba_indexes where index_name='ZIGGY4_WEIRD_I';

INDEX_NAME         BLEVEL LEAF_BLOCKS   NUM_ROWS
-------------- ---------- ----------- ----------
ZIGGY4_WEIRD_I          2        1817     500000

 

Is still smaller at 1817 leaf blocks than the best B-Tree index has to offer.

So the answer is no, Bitmap Indexes are not now redundant now we have Index Advanced Compression. In Data Warehouse environments, as long as they don’t reference column values that are approaching uniqueness,  Bitmap Indexes are likely going to be smaller than corresponding compressed B-Tree indexes.

12.1.0.2 Introduction to Zone Maps Part II (Changes) October 30, 2014

Posted by Richard Foote in 12c, Exadata, Oracle Indexes, Zone Maps.
1 comment so far

In Part I, I discussed how Zone Maps are new index like structures, similar to Exadata Storage Indexes, that enables the “pruning” of disk blocks during accesses of the table by storing the min and max values of selected columns for each “zone” of a table. A Zone being a range of contiguous (8M) blocks.

I showed how a Zone Map was relatively tiny but very effective in reducing the number of consistent gets for a well clustered column (ALBUM_ID).

In this post, we’re going to continue with the demo and look at what happens when we update data in the table with a Zone Map in place.

So lets update the ALBUM_ID column (which currently has a Zone Map defined) for a few rows. The value of ALBUM_ID was previously 1 for all these rows (the full range of values is currently between 1 and 100) but we’re going to update them to 142:

SQL> update big_bowie set album_id=142 where id between 1 and 100;

100 rows updated.

SQL> commit;

Commit complete.

 

So the maximum value of ALBUM_ID is now 142, not 100. If we look at the maximum value as currently listed in the Zone Map:

 

SQL> select max(max_1_album_id) from  big_bowie_album_id_zm;

MAX(MAX_1_ALBUM_ID)
-------------------
100

 

We notice the maximum is still defined as being 100. So the update on the table has not actually updated the contents of the Zone Map. So this is a big difference between Zone Maps and conventional indexes, indexes are automatically updated during DML operations, Zone Maps are not (unless the REFRESH ON COMMIT option is specified).

If we look at the state of Zone Map entries that have a minimum of 1 (the previous values of ALBUM_ID before the update):

SQL> select * from big_bowie_album_id_zm where min_1_album_id = 1;

ZONE_ID$ MIN_1_ALBUM_ID MAX_1_ALBUM_ID ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- -------------- -------------- ----------- ----------- ----------
3.8586E+11              1              2           0           0      66234
3.8586E+11              1              2           0           1      65787
3.8586E+11              1              2           0           0      66223

 

We notice that one of the entries has a status of 1, meaning that a specific zone has been marked as stale. However, all the other zones are still OK.

If we look at the status of the overall Zone Map:

SQL> select zonemap_name, pruning, refresh_mode, invalid, stale, unusable
from dba_zonemaps where zonemap_name='BIG_BOWIE_ALBUM_ID_ZM';

ZONEMAP_NAME              PRUNING  REFRESH_MODE      INVALID STALE   UNUSABLE
------------------------- -------- ----------------- ------- ------- --------
BIG_BOWIE_ALBUM_ID_ZM     ENABLED  LOAD DATAMOVEMENT NO      NO      NO

 

We notice that the Zone Map is still “hunky dory” after the update.

If we now re-run the query we ran in Part I:

 

SQL> select * from big_bowie where album_id = 42;

100000 rows selected.

Elapsed: 00:00:00.29

Execution Plan
----------------------------------------------------------
Plan hash value: 1980960934

----------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |           |   100K|  8984K|  3269  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| BIG_BOWIE |   100K|  8984K|  3269  (12)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("ALBUM_ID"=42)
filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_ALBUM_ID" > :1 OR
zm."MAX_1_ALBUM_ID" < :2) THEN 3 ELSE 2 END END FROM "BOWIE"."BIG_BOWIE_ALBUM_ID_ZM" zm
WHERE zm."ZONE_LEVEL$"=0 ORDER BY zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),42,42)<3 AND "ALBUM_ID"=42)
Statistics
----------------------------------------------------------
141  recursive calls
0  db block gets
3238  consistent gets
0  physical reads
0  redo size
3130019  bytes sent via SQL*Net to client
761  bytes received via SQL*Net from client
21  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
100000  rows processed

 

We see the Zone Map was still used by the CBO. The number of consistent gets has increased (up from 2364 to 3238) as we now have to additional access all the blocks associated with this stale zone, but it’s still more efficient that reading all the blocks from the entire table.

If we want to remove the stale zone entries, we can refresh the Zone Map or rebuild it (for ON DEMAND refresh):

 

SQL> alter materialized zonemap big_bowie_album_id_zm rebuild;

Materialized zonemap altered.

 

If we now look at the Zone Map entry:

 

SQL> select * from big_bowie_album_id_zm where min_1_album_id = 1;

ZONE_ID$ MIN_1_ALBUM_ID MAX_1_ALBUM_ID ZONE_LEVEL$ ZONE_STATE$ ZONE_ROWS$
---------- -------------- -------------- ----------- ----------- ----------
3.8586E+11              1              2           0           0      66234
3.8586E+11              1            142           0           0      65787
3.8586E+11              1              2           0           0      66223

 

We see that the entry is no longer stale and now correctly reflects the actual maximum value within the zone (142).

If we now re-run the query:

SQL> select * from big_bowie where album_id = 42;

100000 rows selected.

Elapsed: 00:00:00.30

Execution Plan
----------------------------------------------------------
Plan hash value: 1980960934

----------------------------------------------------------------------------------------------------
| Id  | Operation                              | Name      | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                       |           |   100K|  8984K|  3269  (12)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL WITH ZONEMAP| BIG_BOWIE |   100K|  8984K|  3269  (12)| 00:00:01 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("ALBUM_ID"=42)
filter(SYS_ZMAP_FILTER('/* ZM_PRUNING */ SELECT "ZONE_ID$", CASE WHEN
BITAND(zm."ZONE_STATE$",1)=1 THEN 1 ELSE CASE WHEN (zm."MIN_1_ALBUM_ID" > :1 OR
zm."MAX_1_ALBUM_ID" < :2) THEN 3 ELSE 2 END END FROM "BOWIE"."BIG_BOWIE_ALBUM_ID_ZM" zm
WHERE zm."ZONE_LEVEL$"=0 ORDER BY zm."ZONE_ID$"',SYS_OP_ZONE_ID(ROWID),42,42)<3 AND "ALBUM_ID"=42)
Statistics
----------------------------------------------------------
141  recursive calls
0  db block gets
3238  consistent gets
0  physical reads
0  redo size
3130019  bytes sent via SQL*Net to client
761  bytes received via SQL*Net from client
21  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
100000  rows processed

 

We notice nothing has appreciably changed, the Zone Map is still being used but the number of consistent gets remains the same as before. Why haven’t we returned back to our previous 2364 consistent gets ?

Well, as the range of possible values within the updated zone is now between 1 and 142, the required value of 42 could potentially be found within this zone and so still needs to be accessed just in case. We know that the value of 42 doesn’t exist within this zone, but Oracle has no way of knowing this based on the possible 1 to 142 range.

Hence Zone Maps work best when the data is well clustered and the Min/Max ranges of each zone can be used to limit which zones need to be accessed. If the data was not well clustered and the values within each zone mostly had ranges between the min and max values, then Oracle wouldn’t be able to effectively prune many/any zone and the Zone Map would be useless.

As we’ll see in Part III 🙂