jump to navigation

Indexing A Column With Just One Distinct Value (All The Madmen) August 10, 2011

Posted by Richard Foote in CBO, Oracle Indexes.
trackback

When one thinks about a column that might benefit from being indexed, one generally considers columns with lots of different values so that the selectivity of the column is such that relatively few rows get selected, making the index appealing to the cost based optimizer.
 
There are of course many exceptions to this generalisation and there are times when an index is the perfect and most efficient method when selecting 100% of all data, even if it involves a normal index range scan reading every row out of the table.
 
There are also times when a column that has very few distinct values should be indexed, even with a standard B-tree index.
 
In this example, I’ll index a column that only has just the 1 single value and the CBO will choose to use the index gladly.
 
Just going to create my standard little table and create an index on the CODE column:
 

 
SQL> create table bowie (id number, code number, name varchar2(50));
 
Table created.
 
SQL> create index bowie_code_i on bowie(code);
 
Index created.

 

 

I’ll now load the table with 1 million rows but all the CODE column will remain unpopulated and consist only of NULLS:
 

 
SQL> insert into bowie select rownum, null, 'Ziggy Stardust and the Spiders From Mars' from dual connect by level <= 1000000;
 
1000000 rows created.
 
SQL> commit;
 
Commit complete.

 

 

I’ll only populate one in every 10,000 rows that a CODE value of 42 (of course):
 

 
SQL> update bowie set code = 42 where mod(id,10000) = 0;
 
100 rows updated.
 
SQL> commit;
 
Commit complete.

 

 

Let’s collect accurate statistics, however note I’m not collecting histograms even though there are relatively few rows that have a CODE value of 42:
 

 
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE', cascade=> true, estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE 1');
 
PL/SQL procedure successfully completed.

 

 

Indeed, the column only has the one value (42) which occurs relatively infrequently in my data, most rows however remain as NULL:
 

 
SQL> select code, count(*) from bowie group by code;
 
      CODE   COUNT(*)
---------- ----------
               999900
        42        100

 

 

Note that the index statistics clearly shows it only has the 1 distinct value. Remember, NULLS are not indexed by default in B-Tree indexes and as a result the index is tiny with just the one leaf block:
 

 
SQL> select blevel, leaf_blocks, distinct_keys from dba_indexes where index_name='BOWIE_CODE_I';
 
    BLEVEL LEAF_BLOCKS DISTINCT_KEYS
---------- ----------- -------------
         0           1             1

 

 

Note also that the column statistics clearly highlight there’s just the one distinct value, however it also records that there are many rows (999900) that are NULL:

 
 

 
SQL> select column_name, num_distinct, num_nulls from dba_tab_columns where table_name = 'BOWIE' and column_name = 'CODE';
 
COLUMN_NAME  NUM_DISTINCT  NUM_NULLS
------------ ------------ ----------
CODE                    1     999900

 

 

Therefore, Oracle with accurate statistics and without requiring any histograms has all the information it needs to know when selecting rows that contain the one and only distinct value of our CODE column, that it will only actually be selecting 100 rows out of the 1 million rows in the table.

 
 

 
SQL> select * from bowie where code = 42;
 
100 rows selected.
 

Execution Plan
----------------------------------------------------------
Plan hash value: 1602289932
 
--------------------------------------------------------------------------------------------
| Id  | Operation                   | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
--------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |              |   100 |  4700 |   101   (0)| 00:00:02 |
|   1 |  TABLE ACCESS BY INDEX ROWID| BOWIE        |   100 |  4700 |   101   (0)| 00:00:02 |
|*  2 |   INDEX RANGE SCAN          | BOWIE_CODE_I |   100 |       |     1   (0)| 00:00:01 |
--------------------------------------------------------------------------------------------
 

Predicate Information (identified by operation id):
---------------------------------------------------
 
   2 - access("CODE"=42)
 

Statistics
----------------------------------------------------------
          1  recursive calls
          0  db block gets
        102  consistent gets
          0  physical reads
          0  redo size
       1423  bytes sent via SQL*Net to client
        395  bytes received via SQL*Net from client
          2  SQL*Net roundtrips to/from client
          0  sorts (memory)
          0  sorts (disk)
        100  rows processed

 

 

The CBO has the rows estimate/cardinality spot on (100) and has decided that the index is indeed the most efficient access path to select the 100 rows of interest.

With flag and bollean type columns, it might be worth some consideration simply storing the one value when one value is relatively rare, such that the resultant index can take advantage of not having to store any of the corresponding NULL values.
 
Even columns with just the one distinct value can potentially benefit from being indexed …

Comments»

1. oracledisect - August 11, 2011

Good post! As usual…

However, rises the question about rareness or when the proportion of non-null one distinct value versus null population makes a good candidate, I mean a cuantitative criteria.

Like

Richard Foote - August 23, 2011

It all comes down as always to the relative costs associated with accessing the rows via the index (so the clustering factor as well as the overall number of rows is critical) vs. the cost of other alternatives (such as a full table scan).

Like

2. Devesh Shastri - August 11, 2011

Excellent Post.

Like

Richard Foote - August 23, 2011

Thanks Devesh 🙂

Like

3. Brian Tkatch - August 11, 2011

“With flag and bollean type columns, it might be worth some”

Personally, i despise boolean COLUMNs. In most cases, a “flag” should really be status, and only causes issues later on whena second COLUMN is ADDed to makeup for it.

Though, the idea of using NULL for “no data” and having the unstored advantage is well said.

Like

Richard Foote - August 23, 2011

Hi Brian

I agree with your views of boolean and its dangers.

Like

4. Log Buffer #233, A Carnival of the Vanities for DBAs | The Pythian Blog - August 12, 2011

[…] Foote and Indexes go hand in hand. Here is another sparkling post from […]

Like

5. fabio - August 19, 2011

Nice topic! Good job!

Like

Richard Foote - August 23, 2011

Thanks Fabio 🙂

Like

6. Jonathan Lewis - August 22, 2011

This, of course, is one of the great benefits of function-based indexes (when you can also modify the SQL) – you can take the multi-valued status column from Brian Tkatch and create as many function-based indexes on it as you like – one for each of the “rare” values – defining each index to expose the rows in a given status.

Like

Richard Foote - August 23, 2011

Hi Jonathan

Exactly. And the advanages of partitioning when you can’t modify the SQL.

Like


Leave a comment