jump to navigation

12.1.0.2 Introduction to Attribute Clustering (The Division Bell) August 26, 2014

Posted by Richard Foote in 12c, Attribute Clustering, Clustering Factor, Oracle Indexes.
trackback

One of the really cool new features introduced in 12.1.0.2 is Attribute Clustering. This new table based attribute allows you to very easily cluster data in close physical proximity based on the content of specific columns.

As I’ve discussed many times, indexes love table data that is physically clustered in a similar manner to the index as it can significantly improve the efficiency of such indexes. A low Clustering Factor (CF) makes an index more viable and is one of the more important considerations in CBO calculations.

But not only database indexes benefit from well cluster data. Other index structures such as Exadata Storage Indexes and the new Zone Maps (to be discussed in future articles) all benefit from well clustered data. Additionally, compression is likely to be much more effective with data that is well clustered and this in turns also impacts the efficiency of In-memory data (again, to be discussed in future articles).

So having the capability to now easily cluster data in regular heap tables has potentially many benefits.

To illustrate, I’m first going to create a table with data that is not well clustered at all. The CODE column has data that is basically evenly distributed throughout the whole table structure:

SQL> create table ziggy (id number, code number, name varchar2(30));

Table created.

SQL> insert into ziggy select rownum, mod(rownum,100), 'DAVID BOWIE' from dual connect by level >= 2000000;

2000000 rows created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'ZIGGY', estimate_percent=>null,
method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

I’ll next create an index on this CODE column and check out its default CF:

SQL> create index ziggy_code_i on ziggy(code);

Index created.

SQL> select index_name, clustering_factor, num_rows from user_indexes
where index_name='ZIGGY_CODE_I';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY_CODE_I                    703133    2000000

For a table with 2 million rows, a CF of some 703,133 is very high and the index is going to be very inefficient when retrieving high numbers of rows.

Let’s run a query that returns a specific CODE value, approx. 1% of all the data (note I’ve set a large arraysize to minimize unnecessary fetches and resultant consistent  gets):

SQL> set arraysize 5000

SQL> select * from ziggy where code = 42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

-----------------------------------------------------------------------------------
| Id  | Operation                 | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |       | 20000 |   390K|   383  (17)| 00:00:01 |
|*  1 |  TABLE ACCESS STORAGE FULL| ZIGGY | 20000 |   390K|   383  (17)| 00:00:01 |
-----------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - storage("CODE"=42)
filter("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
15212  consistent gets
0  physical reads
0  redo size
211208  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

The CBO has chosen a Full Table Scan and has decided to not use the index. If we hint the SQL:

SQL> select /*+ index (ziggy, ziggy_code_i) */ * from ziggy where code = 42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3294205578

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              | 20000 |   390K|  7081   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY        | 20000 |   390K|  7081   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
7081  consistent gets
41  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

At a cost of 7081, the index is way more expensive than the 383 cost for the FTS. The poor clustering of the CODE data within the table has made the index non viable.

Let’s now create another table, but this one with a clustering attribute set on the CODE column:

SQL> create table ziggy2 (id number, code number, name varchar2(30))

clustering by linear order (code) without materialized zonemap;

Table created.

The CLUSTERING BY LINEAR ORDER clause orders data in the table based on the specified columns, in this case the CODE column. Up to 10 columns can be included using this particular technique (there are other attribute clustering options which I’ll again cover in later articles, yes I’ll be writing quite a few new articles) 🙂 WITHOUT MATERIALIZED ZONEMAP means I don’t want to create these new Zone Maps index structures at this stage which could potentially reduce the amount of table storage needed to be accessed (again, I’ll discuss these at another time).

You must use a direct path insert to make use of attribute clustering (or reorganize the table as we’ll see).

So lets insert the exact same data into this new ZIGGY2 table via a straight direct path sub-select:

SQL> insert /*+ append */ into ziggy2 select * from ziggy;

2000000 rows created.

SQL> SELECT * FROM TABLE(dbms_xplan.display_cursor);

PLAN_TABLE_OUTPUT
--------------------------------------------------------------------------------
SQL_ID  0arqdyc9vznpg, child number 0
-------------------------------------
insert /*+ append */ into ziggy2 select * from ziggy

Plan hash value: 1975011999

--------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name  | Rows  | Bytes |TempSpc| Cost(%CPU)| Time     |
--------------------------------------------------------------------------------------------------
|   0 | INSERT STATEMENT                 |       |       |       |       | 10596 (100)|          |
|   1 |  LOAD AS SELECT                  |       |       |       |       |            |          |
|   2 |   OPTIMIZER STATISTICS GATHERING |       |  2000K|    38M|       | 10596   (3)| 00:00:01 |
|   3 |    SORT ORDER BY                 |       |  2000K|    38M|    61M| 10596   (3)| 00:00:01 |
|   4 |     TABLE ACCESS STORAGE FULL    | ZIGGY |  2000K|    38M|       |   376  (16)| 00:00:01 |
--------------------------------------------------------------------------------------------------

SQL> commit;

Commit complete.

Notice the SORT ORDER BY step in the insert execution plan. This implicitly sorts the incoming data in CODE order to satisfy the attribute clustering requirement.

If we create an index on this table and examine the CF:

SQL> exec dbms_stats.gather_table_stats(ownname=>user, tabname=>'ZIGGY2',
estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE 1');

PL/SQL procedure successfully completed.

SQL> select index_name, clustering_factor, num_rows
from user_indexes where index_name='ZIGGY2_CODE_I';

INDEX_NAME           CLUSTERING_FACTOR   NUM_ROWS
-------------------- ----------------- ----------
ZIGGY2_CODE_I                     7072    2000000

We notice the default CF is indeed significantly lower at just 7072 than the previous value of 703133.

If we now run the equivalent query as before on this table:

SQL> select * from ziggy2 where code=42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 16801974

-----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               | 20000 |   390K|   114   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY2        | 20000 |   390K|   114   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY2_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
-----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
1  recursive calls
0  db block gets
121  consistent gets
41  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

We notice the CBO has now decided to use the index. This is due to the cost of the index based execution plan being just 114, significantly lower than the previous index cost of 7081 or the FTS at a cost of 383. Just as importantly, the resultant number of consistent gets has also significantly reduced to just 121, significantly less than the previous 7081 consistent gets when using the index. So the index is indeed much more efficient to use and the CBO costs for this is just about spot on. The end result is that performance has improved.

So how to now likewise improve the performance of the first table? Simple add the attribute clustering and reorganize the table:

SQL> alter table ziggy add clustering by linear order(code) without materialized zonemap;

Table altered.

SQL> alter table ziggy move;

Table altered.

SQL> alter index ziggy_code_i rebuild;

Index altered.

SQL> select index_name, clustering_factor, num_rows from user_indexes where index_name='ZIGGY_CODE_I';

INDEX_NAME      CLUSTERING_FACTOR   NUM_ROWS
--------------- ----------------- ----------
ZIGGY_CODE_I                 7134    2000000

So as expected, the CF has likewise reduced. So if we now run the query:

SQL> select * from ziggy where code=42;

20000 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 3294205578

----------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |              | 20000 |   390K|   115   (1)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY        | 20000 |   390K|   115   (1)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN                  | ZIGGY_CODE_I | 20000 |       |    43   (3)| 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("CODE"=42)

Statistics
----------------------------------------------------------
0  recursive calls
0  db block gets
121  consistent gets
0  physical reads
0  redo size
511195  bytes sent via SQL*Net to client
585  bytes received via SQL*Net from client
5  SQL*Net roundtrips to/from client
0  sorts (memory)
0  sorts (disk)
20000  rows processed

The query likewise uses the index and with far less consistent gets and performance is significantly better.

So attribute clustering provides a nice mechanism by which data in a heap table (or importantly within a partition or sub-partition) can be physically clustered in a manner that can be potentially beneficial in various scenarios. Of course, the decision on how to actually cluster the data and on which columns is somewhat crucial 🙂

Comments»

1. Hemant K Chitale - August 26, 2014

New INSERTs on an Attribute Clustered table would work similar to INSERTs on an Index Organized Table ? Here you’ve show the clustering attribute and the index being the same column. What happens with multiple / additional indexes ? The same as with secondary indexes in an IOT ?

Like

Stefan Koehler - August 28, 2014

Hi Hemant,
unfortunately not (“Attribute clustering is ignored for conventional DML”) or you are using Direct Path Inserts all the time. However this is not the common use case in OLTP environments.

Attribute clustered tables and its supported operations is described here: http://docs.oracle.com/database/121/ADMIN/tables.htm#ADMIN13999

Regards
Stefan

Like

Richard Foote - September 3, 2014

That’s right, this doesn’t work for conventional inserts and is why I said in the piece “You must use a direct path insert to make use of attribute clustering or reorganize the table “.

Like

2. connormcdonald - October 31, 2014

I’m confused. It doesnt look to me like I get anything more than a “order by” for free….and since I’m doing direct mode anyway, I’m guessing (most of the time) I have control over the underlying insert statement, so I’d just add the order by myself…

Am I missing something ?

Like

Richard Foote - October 31, 2014

Hi Connor

A big difference is that it’s now an attribute of the table so every time you say Move the table as part of a re-org (remembering DMLs are not clustered and that you can move partitions online), the clustering attribute is implicitly enforced. Additionally, some of the interleaved clustering capabilities might not be so easy to manually implement and subsequently keep in check.

Go Crystal Palace 🙂

Like

3. Oracle 19c Automatic Indexing: Poor Data Clustering With Autonomous Databases Part I (Don’t Look Down) | Richard Foote's Oracle Blog - August 6, 2020

[…] discussed many times the importance of data clustering in relation to the efficiency of indexes. With respect to the efficiency of Automatic Indexes […]

Like


Leave a comment