Improve Data Clustering on Multiple Columns Concurrently (Two Suns in the Sunset)

Improve Data Clustering on Multiple Columns Concurrently (Two Suns in the Sunset) March 12, 2018

Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Online DDL, Oracle Indexes.
trackback

I’ve had a couple of recent discussions around clustering and how if you attempt to improve the clustering of a table based on a column, you thereby ruin the current clustering that might exist for a different column. The common wisdom being you can only order the data one way and if you change the order, you might improve things for one column but totally stuff things up for another.

However, that’s not strictly correct. Depending on the characteristics of your data, you can potentially order (or interleave) data based on multiple columns concurrently. It’s quite possible to have good or good enough clustering on multiple columns and this is extremely important for indexes, as the efficiency of an index can be directly impacted by the clustering of data on the underlining tables.

So to illustrate, I’m going to create a table that initially has terrible clustering on two unrelated columns (code and grade) :

SQL> create table ziggy (id number, code number, grade number, name varchar2(42));

Table created.

SQL> insert into ziggy select rownum, mod(rownum, 100)+1, ceil(dbms_random.value(0,100)), 'ZIGGY STARDUST'
from dual connect by level  commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY', method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy_code_i on ziggy(code);

Index created.

SQL> create index ziggy_grade_i on ziggy(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes
where table_name='ZIGGY';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                   1748800    4000000
ZIGGY_GRADE_I                  1572829    4000000

So with values for both columns distributed all throughout the table, the Clustering Factor of both the CODE and GRADE indexes are both quite poor (values of 1748800 and 1572829 respectively). Even though both columns have 100 distinct values (and so a selectivity of 1%), the CBO will likely consider the indexes too inefficient to use:

SQL> select * from ziggy where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time    |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       | 40000 | 1054K |   4985 (10) | 00:00:01|
| * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K |   4985 (10) | 00:00:0 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
   20292 consistent gets
       0 physical reads
       0 redo size
 1058750 bytes sent via SQL*Net to client
   29934 bytes received via SQL*Net from client
    2668 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
  40000 rows processed

SQL> select * from ziggy where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       | 40000 | 1054K |  5021 (10) | 00:00:01 |
| * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K |  5021 (10) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("GRADE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
   20307 consistent gets
       0 physical reads
       0 redo size
 1065641 bytes sent via SQL*Net to client
   30121 bytes received via SQL*Net from client
    2685 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40257 rows processed

So even though the CBO has got the row estimates just about spot on, in both cases a Full Table Scan was chosen.

Let’s create another table based on the table above but this time order the data in CODE column order:

SQL> create table ziggy2 as select * from ziggy order by code;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY2', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy2_code_i on ziggy2(code);

Index created.

SQL> create index ziggy2_grade_i on ziggy2(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY2';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY2_CODE_I                    17561    4000000
ZIGGY2_GRADE_I                 1577809    4000000

We can see that by doing so, we have significantly reduced the Clustering Factor of the CODE index (down from 1748800 to just 17561) . The GRADE index though has changed little as there’s little co-relation between the CODE and GRADE columns.

If we now run the same query with the CODE based predicate:

SQL> select * from ziggy2 where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 16801974

-----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name          | Rows  | Bytes | Cost (%CPU) | Time     |
-----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |               | 40000 | 1054K |     264 (4) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY2        | 40000 | 1054K |     264 (4) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY2_CODE_I | 40000 |       |      84 (5) | 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     273 consistent gets
       0 physical reads
       0 redo size
 1272038 bytes sent via SQL*Net to client
     685 bytes received via SQL*Net from client
       9 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40000 rows processed

The CBO has not only used the index, but the query is much more efficient as a result, with just 273 consistent gets required to retrieve 40000 rows.

However the query based on the GRADE predicate still uses a FTS:

SQL> select * from ziggy2 where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1810052534

----------------------------------------------------------------------------
| Id | Operation         | Name   | Rows  | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------
|  0 | SELECT STATEMENT  |        | 40000 | 1054K |   4920 (10) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY2 | 40000 | 1054K |   4920 (10) | 00:00:01 |
----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("GRADE"=42)

Statistics
----------------------------------------------------------
      0 recursive calls
     11 db block gets
  17602 consistent gets
      0 physical reads
      0 redo size
 434947 bytes sent via SQL*Net to client
    696 bytes received via SQL*Net from client
     10 SQL*Net roundtrips to/from client
      0 sorts (memory)
      0 sorts (disk)
  40257 rows processed

Now if we decide that actually the query based on GRADE is far more important to the business, we could of course reorder the data again. The following is yet another table, this time based on the CODE sorted ZIGGY2 table, but inserted in GRADE column order:

SQL> create table ziggy3 as select * from ziggy2 order by grade;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY3', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy3_code_i on ziggy3(code);

Index created.

SQL> create index ziggy3_grade_i on ziggy3(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY3';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY3_CODE_I                    30231    4000000
ZIGGY3_GRADE_I                   17582    4000000

We notice we now have an excellent, very low Clustering Factor for the GRADE index (down to just 17582). But notice also the Clustering Factor for CODE. Although it has increased from 17561 to 30231, it’s nowhere near as bad as it was initially when is was a massive 1748800.

The point being that with the data already ordered on CODE, Oracle inserting the data in GRADE order effectively had the data already sub-ordered on CODE. So we end up with perfect clustering on the GRADE column and “good enough” clustering on CODE as well.

If we now run the same queries again:

SQL> select * from ziggy3 where code=42;

40000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 1004048030

-----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name          | Rows  | Bytes | Cost (%CPU) | Time     |
-----------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |               | 40000 | 1054K |     392 (3) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3        | 40000 | 1054K |     392 (3) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY3_CODE_I | 40000 |       |      84 (5) | 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     401 consistent gets
       0 physical reads
       0 redo size
 1272038 bytes sent via SQL*Net to client
     685 bytes received via SQL*Net from client
       9 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40000 rows processed

With the CODE based query, the CBO still uses the index and performance is still quite good with consistent gets having gone up a tad (401 up from 273). However, we now have the scenario where the GRADE based query is also efficient with the index access also selected by the CBO:

SQL> select * from ziggy3 where grade=42;

40257 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 844233985

------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name           | Rows  | Bytes | Cost (%CPU) | Time     |
------------------------------------------------------------------------------------------------------
|  0 | SELECT STATEMENT                    |                | 40000 | 1054K |     264 (4) | 00:00:01 |
|  1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3         | 40000 | 1054K |     264 (4) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY3_GRADE_I | 40000 |       |      84 (5) | 00:00:01 |
------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("GRADE"=42)

Statistics
----------------------------------------------------------
       0 recursive calls
       0 db block gets
     278 consistent gets
       0 physical reads
       0 redo size
 1280037 bytes sent via SQL*Net to client
     696 bytes received via SQL*Net from client
      10 SQL*Net roundtrips to/from client
       0 sorts (memory)
       0 sorts (disk)
   40257 rows processed

We are relying here however on how Oracle actually loads the data on the non-sorted columns, so we can guarantee good clustering on both these columns by simply ordering the data on both columns. Here’s table number 4 with data explicitly sorted on both columns (the values of CODE sub-sorted within the ordering of GRADE):

SQL> create table ziggy4 as select * from ziggy3 order by grade, code;

Table created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY4', method_opt=> 'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> create index ziggy4_code_i on ziggy4(code);

Index created.

SQL> create index ziggy4_grade_i on ziggy4(grade);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes 

where table_name='ZIGGY4';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY4_CODE_I                    27540    4000000
ZIGGY4_GRADE_I                   17583    4000000

We notice we have a near perfect Clustering Factor on the GRADE column (just 17583) and a “good enough” Clustering Factor on the CODE column (27540).

With 12c Rel 2, we can effectively “fix” the original poorly clustered table online on both columns by adding an appropriate Clustering Attribute to the table (new in 12.1) and performing a subsequent Online table reorg (new in 12.2):

SQL> alter table ziggy add clustering by linear order (grade, code);

Table altered.

SQL> alter table ziggy move online;

Table altered.

SQL> select index_name, clustering_factor, num_rows from user_indexes

where table_name='ZIGGY';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                     27525    4000000
ZIGGY_GRADE_I                    17578    4000000

We now have the same excellent Clustering Factor values as we had in the previous example.

Depending on data characteristics, you could potentially use the Interleave Clustering Attribute for good enough Clustering Factor values on your multiple columns, rather than perfect clustering on specific columns.

So it is entirely possible to have the necessary data ordering you need for effective data accesses on multiple columns concurrently.

Comments»

1. IUDITH MENTZEL - April 8, 2018: Hello Richard,

This seems very interesting indeed !
As by my understanding, performing a simple ORDER BY specifying the two columns, like in the sample ZIGGY4,
I would expect a good clustering factor by the leading column and a bad (or bad enough) one for the second column, because the values in that second column are in fact not sorted.

As by the Oracle Data Warehouse documentation, this is exactly the difference between the LINEAR and INTERLEAVED clustering orders,
that is, the INTERLEAVED clustering order is able to stored the data
physically clustered by several columns simultaneously,
and therefore such a clustering allows for good performance when querying by any of the clustering attributes, while the linear clustering
only ensures this for the leading (prefix) clustering attributes, in the order in which these were specified.

Could it be that when you order by two columns or add a LINEAR attribute clustering Oracle does however use behind the scenes the same “Z-order curve fitting” clustering technique as it does for the INTERLEAVED clustering ?

Thanks a lot once again for all your excellent posts 🙂

Best Regards,
Iudith Mentzel

LikeLike

Reply
Richard Foote - April 12, 2018: Hi ludith

When you say the second column is not sorted, that’s not quite correct. It is sorted but only within the values of the leading column. So the fewer distinct values you have in the leading column, the better the potential clustering of the second column. In the example above, there are only 100 distinct values for both columns, so that means values for the second column will be well clustered within only 100 different areas with the table, thereby enabling a relatively good clustering factor.

LikeLike

Reply
2. Oracle 19c Automatic Indexing: Poor Data Clustering With Autonomous Databases Part II (Wild Is The Wind) | Richard Foote's Oracle Blog - August 10, 2020: […] of such an SQL were critical for business requirements, there is a way to address this scenario, by re-clustering the data within the table to align itself with the index. Although the re-clustering table operation can now be very easily […]

LikeLike

Reply

	PP on Costing Concatenated Indexes W…
	Zeeshan Moin on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	PP on The End Of Richard Foote…
	Ram on The Fake Index Trap (“No…
	Richard Foote on The End Of Richard Foote…
	NormanDunbar on The End Of Richard Foote…
	NormanDunbar on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Sayan Malakshinov on The End Of Richard Foote…

Richard Foote's Oracle Blog