jump to navigation

METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster) September 1, 2011

Posted by Richard Foote in CBO, Histograms, Oracle Indexes, Oracle Statistics.
16 comments

I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !!

Unfortunately, some might be disappointed at both Oracle and myselfūüôā

Yes, I did kinda set things up¬†to trick the unwary and yes, perhaps the answer isn’t what many are expecting.

The answer to my previous question of which column is going to have a histogram when using the METHOD_OPT  SIZE AUTO option is in fact Column 2. Well done to everyone who got it right.

Why ?

The simplest answer is because it’s the only column of the three that has 254 or less distinct values.

Here’s the key point. When using METHOD_OPT SIZE AUTO,¬†every column with¬†254 or less distinct values that¬†has been referenced within a predicate, will have a Frequency-based histogram. Each and every one of them, regardless of whether the data is actually skewed or not. So Column 2 with only 254 distinct values AND having previously been referenced in a predicate was guaranteed to have a histogram.

If a column has more than 254 distinct values, whether it then has a Height-Based histogram depends on how the data is skewed. If the data is perfectly evenly distributed, then it won’t have a histogram. Column 1, having sequenced based unique values will not meet the criteria and so not have a histogram.

Column 3 is interesting. Having inserted the outlier value, it now has 255 distinct values and so no longer qualifies for an automatic frequency based histogram. However, if all its¬†values are evenly distributed, then it won’t qualify for a height based histogram either and¬†Column 3¬†only has just the one outlier value,¬†all other values¬†are evenly distributed¬†values. Unfortunately, Oracle doesn’t pick up on rare outlier values (even if you collect 100% statistics and it’s one of the low/high points of the column) and so will not generate a height-based histogram.

The only column that qualifies is Column 2.

A demo to illustrate. First, let’s create and populate our¬†table:

SQL> create table bowie (id number, code1 number, code2 number);

Table created.

SQL> insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual  connect by level <= 1000000;

1000000 rows created.

SQL> commit;

Commit complete.

Notice I’m using a MOD function to generate a perfectly even distribution of data. I’ve noticed a few examples (such as that by Charles Hooper in the comments of the Quiz posting), in which the DBMS_RANDOM function is used. Note this will almost certainly generate data with enough natural skewness¬†on a 1M table with 254 random values that when the outlier 255th value is introduced,¬†it will qualify for a¬†height-based histogram. Very easy way to test and find out. Simply generate the 1M data with 255 random values and I¬†suggest a¬†height-based histogram is created regardless.

OK, I’ll run some SQL to generate sufficient workload to qualify the columns for automatic histograms:

SQL> select * from bowie where id = 42;
SQL> select * from bowie where code1 = 42;
SQL> select * from bowie where code2 = 42;

BTW, the difference between the SIZE AUTO and SIZE SKEWONLY options, is that AUTO requires previous workload to suggest a histogram might be relevant, SKEWONLY does not. 

If we were to collect statistics at this stage, we would notice that the second and third columns both have a Frequency-Based histogram as both columns only have 254 distinct values and so automatically qualify:

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          FREQUENCY

If we were to run a query using the third column, notice how the cardinality estimates aren’t too bad in this example:

SQL> select * from bowie where code2 > 600;

no rows selected

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |    13 |   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     1 |    13 |   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

There are no rows that are greater than 600 and so an estimate of 1 isn’t too bad at all.

OK, let’s add in this one, tiny little row and collect fresh, <strong>100% accurate statistics</strong>¬†(Note: the¬†accurate statistics is very important as Niall’s examples has demonstrated):

&nbsp;

SQL> insert into bowie values (1000001, 42, 99999999);

1 row created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          NONE

Note that the third column now has 255 distinct values and so no longer qualifies for the automatic Frequency-Based histogram. As most of its data is perfectly evenly distributed with just the one outlier value, the column doesn’t qualify for a Height-based histogram either and so now has no histogram at all.

Note as I collected 100% accurate statistics, Oracle is definitely aware of this outlier value:

SQL> select column_name, low_value, high_value from dba_tab_columns where table_name='BOWIE' and column_name='CODE2';

COLUMN_NAME  LOW_VALUE  HIGH_VALUE
------------ ---------- ------------
CODE2        80         C464646464

SQL> var high_num number
SQL> exec dbms_stats.convert_raw_value('C464646464',:high_num);

PL/SQL procedure successfully completed.

SQL> print high_num

  HIGH_NUM
----------
  99999999

But it’s not enough for Oracle to automatically generate a histogram. Which is a shame really, because now we can have all sorts of problems:

SQL> select * from bowie where code2 > 600;
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |   999K|    12M|   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |   999K|    12M|   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

When previously it had the cardinality estimates spot on, now they’re terrible (expecting not 1 row but 999,000 rows !!) because without a histogram, Oracle is assuming even distribution between its low and high point values.

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY optionsūüėČ

Hope you’re enjoying these little quizzes, I’ll have another one for you all soon.