jump to navigation

Storage Indexes vs Database Indexes Part III: Hidden Values (Hide Away) February 7, 2013

Posted by Richard Foote in Exadata, Oracle Indexes, Smart Scans, Storage Indexes.
trackback

OK, my holiday to Hawaii is now slowing fading away into distant memory. Time for a new post 🙂

In my previous post on differences between Exadata Storage Indexes and Database Indexes Part II, I discussed how the clustering of data within the data is an important factor (pun fully intended !!) in the performance and efficiency of both types of indexes. In this post, I’ll expand a little more on this point and highlight scenarios where a normal database index can significantly improve performance but be totally ineffective as a Storage Index.

Storage Indexes can only determine where required data can’t exist within a (1M default) storage region via the associated Min/Max values of the index. If a required value can’t possibly exist within the Min/Max range boundary, the specific storage region can be ignored and not be accessed during the full scan operation. However, in some scenarios, there can be limitations in the usefulness of only having Min/Max values to work with if both Min and Max values are relatively common and randomly distributed throughout the table. This is not an uncommon scenario when dealing with columns that have data such as code status values.

In the BIG_BOWIE table I’ve been using in this series, I have a FORMAT_ID column that was initially populated with just values 2,4,6,8 and 10, evenly distributed throughout the table. I then updated a relatively small number of these values in the following manner:

SQL> update big_bowie set format_id = 3 where mod(id,10000)=0;
1000 rows updated.

SQL> commit;

Commit complete.

SQL> update big_bowie set format_id = 5 where id between 424242 and 425241;

1000 rows updated.

SQL> commit;

Commit complete.

So I have relatively few FORMAT_ID = 3 (1000 in 10M rows, 0.01% of data) littered throughout the table and a few FORMAT_ID = 5 located in a specific portion/location of the table.

I’ll now create a normal database index on this FORMAT_ID column and a histogram to let the CBO know there are indeed relatively few occurances of values 3 and 5 within the table:

SQL> create index big_bowie_format_id_i on big_bowie(format_id);
Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'BIG_BOWIE', method_opt=> 'FOR COLUMNS FORMAT_ID SIZE 10');

PL/SQL procedure successfully completed.

OK, if we now run a query looking for all the FORMAT_ID = 3:

SQL> select album_id, total_sales from big_bowie where format_id = 3;
1000 rows selected.

Elapsed: 00:00:00.01

Execution Plan
 -----------------------------------------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 -----------------------------------------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 1000 | 11000 | 72 (0)| 00:00:01 |
 | 1 | TABLE ACCESS BY INDEX ROWID| BIG_BOWIE | 1000 | 11000 | 72 (0)| 00:00:01 |
 |* 2 | INDEX RANGE SCAN | BIG_BOWIE_FORMAT_ID_I | 1000 | | 4 (0)| 00:00:01 |
 -----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
 ---------------------------------------------------

2 - access("FORMAT_ID"=3)

Statistics
 ----------------------------------------------------------
 0 recursive calls
 0 db block gets
 1072 consistent gets
 0 physical reads
 0 redo size
 22984 bytes sent via SQL*Net to client
 1250 bytes received via SQL*Net from client
 68 SQL*Net roundtrips to/from client
 0 sorts (memory)
 0 sorts (disk)
 1000 rows processed

We notice  the CBO has used the index to efficiently retrieve just the 0.01% of required rows because of the high selectivity of the query, even though the overall Clustering Factor of the index is appalling.

It does an even better job when accessing FORMAT_ID = 5 as these specific values are all well clustered and  found within in a very tight portion of the table:

SQL> select album_id, total_sales from big_bowie where format_id = 5;
1000 rows selected.

Elapsed: 00:00:00.00

Execution Plan
 -----------------------------------------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 -----------------------------------------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 1000 | 11000 | 72 (0)| 00:00:01 |
 | 1 | TABLE ACCESS BY INDEX ROWID| BIG_BOWIE | 1000 | 11000 | 72 (0)| 00:00:01 |
 |* 2 | INDEX RANGE SCAN | BIG_BOWIE_FORMAT_ID_I | 1000 | | 4 (0)| 00:00:01 |
 -----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
 ---------------------------------------------------

2 - access("FORMAT_ID"=5)

Statistics
 ----------------------------------------------------------
 0 recursive calls
 0 db block gets
 154 consistent gets
 0 physical reads
 0 redo size
 22685 bytes sent via SQL*Net to client
 1250 bytes received via SQL*Net from client
 68 SQL*Net roundtrips to/from client
 0 sorts (memory)
 0 sorts (disk)
 1000 rows processed

At just 154 consistent gets in this example, the results return almost instantaneously. The database index is extremely effective in these examples because it can point just to the locations within the table where these relatively few rows of interest are located.

Let’s see how Storage Indexes cope in this scenario. First, we make the database index invisible:

SQL> alter index big_bowie_format_id_i invisible;
Index altered.

We now run the same query again (this can be repeated as many times as you like):

SQL> select album_id, total_sales from big_bowie where format_id = 3;
1000 rows selected.

Elapsed: 00:00:00.80

Execution Plan
 ---------------------------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 ---------------------------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 1000 | 11000 | 36682 (1)| 00:07:21 |
 |* 1 | TABLE ACCESS STORAGE FULL| BIG_BOWIE | 1000 | 11000 | 36682 (1)| 00:07:21 |
 ---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
 ---------------------------------------------------

1 - storage("FORMAT_ID"=3)
 filter("FORMAT_ID"=3)

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME MB
 ---------------------------------------------------------------- ----------
 cell physical IO bytes saved by storage index 0
 cell physical IO interconnect bytes returned by smart scan .397476196

We notice no matter how often we try, that Storage Indexes have managed to save us precisely nothing. The entire table has had to be scanned and read during the Full Table Scan operation.

SQL> select album_id, total_sales from big_bowie where format_id = 5;
1000 rows selected.

Elapsed: 00:00:00.89

Execution Plan
 ---------------------------------------------------------------------------------------
 | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
 ---------------------------------------------------------------------------------------
 | 0 | SELECT STATEMENT | | 1000 | 11000 | 36682 (1)| 00:07:21 |
 |* 1 | TABLE ACCESS STORAGE FULL| BIG_BOWIE | 1000 | 11000 | 36682 (1)| 00:07:21 |
 ---------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
 ---------------------------------------------------

1 - storage("FORMAT_ID"=5)
 filter("FORMAT_ID"=5)

SQL> select name , value/1024/1024 MB from v$statname n, v$mystat s where n.statistic# = s.statistic# and n.name in ('cell physical IO interconnect bytes returned by smart scan', 'cell physical IO bytes saved by storage index');

NAME MB
 ---------------------------------------------------------------- ----------
 cell physical IO bytes saved by storage index 0
 cell physical IO interconnect bytes returned by smart scan .565643311

Same results when accessing the FORMAT_ID = 5. Any possible Storage Indexes have been totally ineffective and the Full Table Scan has had to read the entire table.

The result being performance is significantly slower than with the database indexes visible and in place.

Why ?

The data consists primarily of randomly distributed values 2,4,6,8 and 10, with the odd value 3 and 5 littered around. This means that if created, the Storage Index Min/Max values will effectively be 2 and 10 for all 1M storage regions throughout the entire table.

Therefore, both values 3 and 5 could possibly exist within all/any of the table storage regions as they sit inside the 2 – 10 Min/Max range boundaries. These 3 and 5 values are effectively “hidden” within the Storage Index Min/Max values making Storage Indexes totally ineffective in this scenario, as they can’t skip a thing.

A final point here. If one of these rarer values of interest were to be say value “12”, then a Storage Index could be useful when searching for this particular value as any Min/Max ranges of 2-10 that don’t therefore contain a value of 12 can be skipped. However, if values 3 and 5 were also of interest, the Storage Index would still only be useful in a subset of required cases. A database index would therefore still be required to cater all necessarily cases.

As Database indexes can directly point to indexed values of interest, this is another example of where we would likely not want to just drop a database index in Exadata.

Comments»

1. Latest data Industry news round up, Log Buffer #306 - February 9, 2013

[…] This is what happens when a vacation in Hawaii fades away for Richard Foote. […]

Like

2. vijay - June 13, 2013

Hi Foote,

Good presentation ,
Can we say , SI is ineffective low cardinality(here it’s 7) ?
and seems like SI effective if the number of distinct values is more .

Thanks
Vijay

Like

Richard Foote - June 26, 2013

Hi Vijay

Not necessarily. It depends on the actual selectivity of the values of interest and where they sit within the min/max range. So for example, if you have just 7 distinct values (1-7) but value 7 occurs very infrequently, then a SI would be very effective in eliminating I/O.

Like


Leave a comment