jump to navigation

12.1.0.2 Introduction to Attribute Clustering (The Division Bell) August 26, 2014

Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Oracle Indexes.
6 comments

One of the really cool new features introduced in 12.1.0.2 is Attribute Clustering. This new table based attribute allows you to very easily cluster data in close physical proximity based on the content of specific columns.

As I’ve discussed many times, indexes love table data that is physically clustered in a similar manner to the index as it can significantly improve the efficiency of such indexes. A low Clustering Factor (CF) makes an index more viable and is one of the more important considerations in CBO calculations.

But not only database indexes benefit from well cluster data. Other index structures such as Exadata Storage Indexes and the new Zone Maps (to be discussed in future articles) all benefit from well clustered data. Additionally, compression is likely to be much more effective with data that is well clustered and this in turns also impacts the efficiency of In-memory data (again, to be discussed in future articles).

So having the capability to now easily cluster data in regular heap tables has potentially many benefits.

To illustrate, I’m first going to create a table with data that is not well clustered at all. The CODE column has data that is basically evenly distributed throughout the whole table structure:

SQL> create table ziggy (id number, code number, name varchar2(30));

Table created.

SQL> insert into ziggy select rownum, mod(rownum,100), 'DAVID BOWIE' from dual connect by level >= 2000000;

2000000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'ZIGGY', estimate_percent=>null,
method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

I’ll next create an index on this CODE column and check out its default CF:

SQL> create index ziggy_code_i on ziggy(code);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes
where index_name='ZIGGY_CODE_I';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                    703133    2000000

For a table with 2 million rows, a CF of some 703,133 is very high and the index is going to be very inefficient when retrieving high numbers of rows.

Let’s run a query that returns a specific CODE value, approx. 1% of all the data (note I’ve set a large arraysize to minimize unnecessary fetches and resultant consistent  gets):

SQL> set arraysize 5000

SQL> select * from ziggy where code = 42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

-----------------------------------------------------------------------------------
| Id  | Operation                 | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |       | 20000 |   390K|   383  (17)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| ZIGGY | 20000 |   390K|   383  (17)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("CODE"=42)
filter("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
15212  consistent gets
0  physical reads
0  redo size
211208  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

The CBO has chosen a Full Table Scan and has decided to not use the index. If we hint the SQL:

SQL> select /*+ index (ziggy, ziggy_code_i) */ * from ziggy where code = 42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3294205578

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              | 20000 |   390K|  7081   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY        | 20000 |   390K|  7081   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
7081  consistent gets
41  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

At a cost of 7081, the index is way more expensive than the 383 cost for the FTS. The poor clustering of the CODE data within the table has made the index non viable.

Let’s now create another table, but this one with a clustering attribute set on the CODE column:

SQL> create table ziggy2 (id number, code number, name varchar2(30))

clustering by linear order (code) without materialized zonemap;

Table created.

The CLUSTERING BY LINEAR ORDER clause orders data in the table based on the specified columns, in this case the CODE column. Up to 10 columns can be included using this particular technique (there are other attribute clustering options which I’ll again cover in later articles, yes I’ll be writing quite a few new articles) 🙂 WITHOUT MATERIALIZED ZONEMAP means I don’t want to create these new Zone Maps index structures at this stage which could potentially reduce the amount of table storage needed to be accessed (again, I’ll discuss these at another time).

You must use a direct path insert to make use of attribute clustering (or reorganize the table as we’ll see).

So lets insert the exact same data into this new ZIGGY2 table via a straight direct path sub-select:

SQL> insert /*+ append */ into ziggy2 select * from ziggy;

2000000 rows created.

SQL> SELECT * FROM TABLE(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID  0arqdyc9vznpg, child number 0
-------------------------------------
insert /*+ append */ into ziggy2 select * from ziggy

Plan hash value: 1975011999

--------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name  | Rows  | Bytes |TempSpc| Cost(%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |       |       |       |       | 10596 (100)|          |
|   1 |  LOAD AS SELECT                  |       |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |       |  2000K|    38M|       | 10596   (3)| 00:00:01 |
|   3 |    SORT ORDER BY                 |       |  2000K|    38M|    61M| 10596   (3)| 00:00:01 |
|   4 |     TABLE ACCESS STORAGE FULL    | ZIGGY |  2000K|    38M|       |   376  (16)| 00:00:01 |
--------------------------------------------------------------------------------------------------

SQL> commit;

Commit complete.

Notice the SORT ORDER BY step in the insert execution plan. This implicitly sorts the incoming data in CODE order to satisfy the attribute clustering requirement.

If we create an index on this table and examine the CF:

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'ZIGGY2',
estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> select index_name, clustering_factor, num_rows
from user_indexes where index_name='ZIGGY2_CODE_I';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY2_CODE_I                     7072    2000000

We notice the default CF is indeed significantly lower at just 7072 than the previous value of 703133.

If we now run the equivalent query as before on this table:

SQL> select * from ziggy2 where code=42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 16801974

-----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               | 20000 |   390K|   114   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY2        | 20000 |   390K|   114   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY2_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
121  consistent gets
41  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

We notice the CBO has now decided to use the index. This is due to the cost of the index based execution plan being just 114, significantly lower than the previous index cost of 7081 or the FTS at a cost of 383. Just as importantly, the resultant number of consistent gets has also significantly reduced to just 121, significantly less than the previous 7081 consistent gets when using the index. So the index is indeed much more efficient to use and the CBO costs for this is just about spot on. The end result is that performance has improved.

So how to now likewise improve the performance of the first table? Simple add the attribute clustering and reorganize the table:

SQL> alter table ziggy add clustering by linear order(code) without materialized zonemap;

Table altered.

SQL> alter table ziggy move;

Table altered.

SQL> alter index ziggy_code_i rebuild;

Index altered.

SQL> select index_name, clustering_factor, num_rows from user_indexes where index_name='ZIGGY_CODE_I';

INDEX_NAME      CLUSTERING_FACTOR   NUM_ROWS
--------------- ----------------- ----------
ZIGGY_CODE_I                 7134    2000000

So as expected, the CF has likewise reduced. So if we now run the query:

SQL> select * from ziggy where code=42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3294205578

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              | 20000 |   390K|   115   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY        | 20000 |   390K|   115   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
0  recursive calls
0  db block gets
121  consistent gets
0  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

The query likewise uses the index and with far less consistent gets and performance is significantly better.

So attribute clustering provides a nice mechanism by which data in a heap table (or importantly within a partition or sub-partition) can be physically clustered in a manner that can be potentially beneficial in various scenarios. Of course, the decision on how to actually cluster the data and on which columns is somewhat crucial 🙂