Improve Data Clustering on Multiple Columns Concurrently (Two Suns in the Sunset) March 12, 2018
Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Online DDL, Oracle Indexes.trackback
I’ve had a couple of recent discussions around clustering and how if you attempt to improve the clustering of a table based on a column, you thereby ruin the current clustering that might exist for a different column. The common wisdom being you can only order the data one way and if you change the order, you might improve things for one column but totally stuff things up for another.
However, that’s not strictly correct. Depending on the characteristics of your data, you can potentially order (or interleave) data based on multiple columns concurrently. It’s quite possible to have good or good enough clustering on multiple columns and this is extremely important for indexes, as the efficiency of an index can be directly impacted by the clustering of data on the underlining tables.
So to illustrate, I’m going to create a table that initially has terrible clustering on two unrelated columns (code and grade) :
SQL> create table ziggy (id number, code number, grade number, name varchar2(42)); Table created. SQL> insert into ziggy select rownum, mod(rownum, 100)+1, ceil(dbms_random.value(0,100)), 'ZIGGY STARDUST' from dual connect by level commit; Commit complete. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY', method_opt=>'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> create index ziggy_code_i on ziggy(code); Index created. SQL> create index ziggy_grade_i on ziggy(grade); Index created. SQL> select index_name, clustering_factor, num_rows from user_indexes where table_name='ZIGGY'; INDEX_NAME CLUSTERING_FACTOR NUM_ROWS -------------------- ----------------- ---------- ZIGGY_CODE_I 1748800 4000000 ZIGGY_GRADE_I 1572829 4000000
So with values for both columns distributed all throughout the table, the Clustering Factor of both the CODE and GRADE indexes are both quite poor (values of 1748800 and 1572829 respectively). Even though both columns have 100 distinct values (and so a selectivity of 1%), the CBO will likely consider the indexes too inefficient to use:
SQL> select * from ziggy where code=42; 40000 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2421001569 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 40000 | 1054K | 4985 (10) | 00:00:01| | * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K | 4985 (10) | 00:00:0 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 20292 consistent gets 0 physical reads 0 redo size 1058750 bytes sent via SQL*Net to client 29934 bytes received via SQL*Net from client 2668 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40000 rows processed SQL> select * from ziggy where grade=42; 40257 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2421001569 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 40000 | 1054K | 5021 (10) | 00:00:01 | | * 1 | TABLE ACCESS FULL | ZIGGY | 40000 | 1054K | 5021 (10) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("GRADE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 20307 consistent gets 0 physical reads 0 redo size 1065641 bytes sent via SQL*Net to client 30121 bytes received via SQL*Net from client 2685 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40257 rows processed
So even though the CBO has got the row estimates just about spot on, in both cases a Full Table Scan was chosen.
Let’s create another table based on the table above but this time order the data in CODE column order:
SQL> create table ziggy2 as select * from ziggy order by code; Table created. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY2', method_opt=> 'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> create index ziggy2_code_i on ziggy2(code); Index created. SQL> create index ziggy2_grade_i on ziggy2(grade); Index created. SQL> select index_name, clustering_factor, num_rows from user_indexes where table_name='ZIGGY2'; INDEX_NAME CLUSTERING_FACTOR NUM_ROWS -------------------- ----------------- ---------- ZIGGY2_CODE_I 17561 4000000 ZIGGY2_GRADE_I 1577809 4000000
We can see that by doing so, we have significantly reduced the Clustering Factor of the CODE index (down from 1748800 to just 17561) . The GRADE index though has changed little as there’s little co-relation between the CODE and GRADE columns.
If we now run the same query with the CODE based predicate:
SQL> select * from ziggy2 where code=42; 40000 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 16801974 ----------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ----------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 40000 | 1054K | 264 (4) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY2 | 40000 | 1054K | 264 (4) | 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY2_CODE_I | 40000 | | 84 (5) | 00:00:01 | ----------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 273 consistent gets 0 physical reads 0 redo size 1272038 bytes sent via SQL*Net to client 685 bytes received via SQL*Net from client 9 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40000 rows processed
The CBO has not only used the index, but the query is much more efficient as a result, with just 273 consistent gets required to retrieve 40000 rows.
However the query based on the GRADE predicate still uses a FTS:
SQL> select * from ziggy2 where grade=42; 40257 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1810052534 ---------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ---------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 40000 | 1054K | 4920 (10) | 00:00:01 | |* 1 | TABLE ACCESS FULL | ZIGGY2 | 40000 | 1054K | 4920 (10) | 00:00:01 | ---------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("GRADE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 11 db block gets 17602 consistent gets 0 physical reads 0 redo size 434947 bytes sent via SQL*Net to client 696 bytes received via SQL*Net from client 10 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40257 rows processed
Now if we decide that actually the query based on GRADE is far more important to the business, we could of course reorder the data again. The following is yet another table, this time based on the CODE sorted ZIGGY2 table, but inserted in GRADE column order:
SQL> create table ziggy3 as select * from ziggy2 order by grade; Table created. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY3', method_opt=> 'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> create index ziggy3_code_i on ziggy3(code); Index created. SQL> create index ziggy3_grade_i on ziggy3(grade); Index created. SQL> select index_name, clustering_factor, num_rows from user_indexes where table_name='ZIGGY3'; INDEX_NAME CLUSTERING_FACTOR NUM_ROWS -------------------- ----------------- ---------- ZIGGY3_CODE_I 30231 4000000 ZIGGY3_GRADE_I 17582 4000000
We notice we now have an excellent, very low Clustering Factor for the GRADE index (down to just 17582). But notice also the Clustering Factor for CODE. Although it has increased from 17561 to 30231, it’s nowhere near as bad as it was initially when is was a massive 1748800.
The point being that with the data already ordered on CODE, Oracle inserting the data in GRADE order effectively had the data already sub-ordered on CODE. So we end up with perfect clustering on the GRADE column and “good enough” clustering on CODE as well.
If we now run the same queries again:
SQL> select * from ziggy3 where code=42; 40000 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1004048030 ----------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ----------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 40000 | 1054K | 392 (3) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3 | 40000 | 1054K | 392 (3) | 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY3_CODE_I | 40000 | | 84 (5) | 00:00:01 | ----------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("CODE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 401 consistent gets 0 physical reads 0 redo size 1272038 bytes sent via SQL*Net to client 685 bytes received via SQL*Net from client 9 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40000 rows processed
With the CODE based query, the CBO still uses the index and performance is still quite good with consistent gets having gone up a tad (401 up from 273). However, we now have the scenario where the GRADE based query is also efficient with the index access also selected by the CBO:
SQL> select * from ziggy3 where grade=42; 40257 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 844233985 ------------------------------------------------------------------------------------------------------ | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ------------------------------------------------------------------------------------------------------ | 0 | SELECT STATEMENT | | 40000 | 1054K | 264 (4) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY3 | 40000 | 1054K | 264 (4) | 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY3_GRADE_I | 40000 | | 84 (5) | 00:00:01 | ------------------------------------------------------------------------------------------------------ Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("GRADE"=42) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 278 consistent gets 0 physical reads 0 redo size 1280037 bytes sent via SQL*Net to client 696 bytes received via SQL*Net from client 10 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 40257 rows processed
We are relying here however on how Oracle actually loads the data on the non-sorted columns, so we can guarantee good clustering on both these columns by simply ordering the data on both columns. Here’s table number 4 with data explicitly sorted on both columns (the values of CODE sub-sorted within the ordering of GRADE):
SQL> create table ziggy4 as select * from ziggy3 order by grade, code; Table created. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=> 'ZIGGY4', method_opt=> 'FOR ALL COLUMNS SIZE 1'); PL/SQL procedure successfully completed. SQL> create index ziggy4_code_i on ziggy4(code); Index created. SQL> create index ziggy4_grade_i on ziggy4(grade); Index created. SQL> select index_name, clustering_factor, num_rows from user_indexes where table_name='ZIGGY4'; INDEX_NAME CLUSTERING_FACTOR NUM_ROWS -------------------- ----------------- ---------- ZIGGY4_CODE_I 27540 4000000 ZIGGY4_GRADE_I 17583 4000000
We notice we have a near perfect Clustering Factor on the GRADE column (just 17583) and a “good enough” Clustering Factor on the CODE column (27540).
With 12c Rel 2, we can effectively “fix” the original poorly clustered table online on both columns by adding an appropriate Clustering Attribute to the table (new in 12.1) and performing a subsequent Online table reorg (new in 12.2):
SQL> alter table ziggy add clustering by linear order (grade, code); Table altered. SQL> alter table ziggy move online; Table altered. SQL> select index_name, clustering_factor, num_rows from user_indexes where table_name='ZIGGY'; INDEX_NAME CLUSTERING_FACTOR NUM_ROWS -------------------- ----------------- ---------- ZIGGY_CODE_I 27525 4000000 ZIGGY_GRADE_I 17578 4000000
We now have the same excellent Clustering Factor values as we had in the previous example.
Depending on data characteristics, you could potentially use the Interleave Clustering Attribute for good enough Clustering Factor values on your multiple columns, rather than perfect clustering on specific columns.
So it is entirely possible to have the necessary data ordering you need for effective data accesses on multiple columns concurrently.
Hello Richard,
This seems very interesting indeed !
As by my understanding, performing a simple ORDER BY specifying the two columns, like in the sample ZIGGY4,
I would expect a good clustering factor by the leading column and a bad (or bad enough) one for the second column, because the values in that second column are in fact not sorted.
As by the Oracle Data Warehouse documentation, this is exactly the difference between the LINEAR and INTERLEAVED clustering orders,
that is, the INTERLEAVED clustering order is able to stored the data
physically clustered by several columns simultaneously,
and therefore such a clustering allows for good performance when querying by any of the clustering attributes, while the linear clustering
only ensures this for the leading (prefix) clustering attributes, in the order in which these were specified.
Could it be that when you order by two columns or add a LINEAR attribute clustering Oracle does however use behind the scenes the same “Z-order curve fitting” clustering technique as it does for the INTERLEAVED clustering ?
Thanks a lot once again for all your excellent posts 🙂
Best Regards,
Iudith Mentzel
LikeLike
Hi ludith
When you say the second column is not sorted, that’s not quite correct. It is sorted but only within the values of the leading column. So the fewer distinct values you have in the leading column, the better the potential clustering of the second column. In the example above, there are only 100 distinct values for both columns, so that means values for the second column will be well clustered within only 100 different areas with the table, thereby enabling a relatively good clustering factor.
LikeLike
[…] of such an SQL were critical for business requirements, there is a way to address this scenario, by re-clustering the data within the table to align itself with the index. Although the re-clustering table operation can now be very easily […]
LikeLike