## METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster) September 1, 2011

Posted by Richard Foote in CBO, Histograms, Oracle Indexes, Oracle Statistics.

I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !!

Unfortunately, some might be disappointed at both Oracle and myself 🙂

Yes, I did kinda set things up to trick the unwary and yes, perhaps the answer isn’t what many are expecting.

The answer to my previous question of which column is going to have a histogram when using the METHOD_OPT  SIZE AUTO option is in fact Column 2. Well done to everyone who got it right.

Why ?

The simplest answer is because it’s the only column of the three that has 254 or less distinct values.

Here’s the key point. When using METHOD_OPT SIZE AUTO, every column with 254 or less distinct values that has been referenced within a predicate, will have a Frequency-based histogram. Each and every one of them, regardless of whether the data is actually skewed or not. So Column 2 with only 254 distinct values AND having previously been referenced in a predicate was guaranteed to have a histogram.

If a column has more than 254 distinct values, whether it then has a Height-Based histogram depends on how the data is skewed. If the data is perfectly evenly distributed, then it won’t have a histogram. Column 1, having sequenced based unique values will not meet the criteria and so not have a histogram.

Column 3 is interesting. Having inserted the outlier value, it now has 255 distinct values and so no longer qualifies for an automatic frequency based histogram. However, if all its values are evenly distributed, then it won’t qualify for a height based histogram either and Column 3 only has just the one outlier value, all other values are evenly distributed values. Unfortunately, Oracle doesn’t pick up on rare outlier values (even if you collect 100% statistics and it’s one of the low/high points of the column) and so will not generate a height-based histogram.

The only column that qualifies is Column 2.

A demo to illustrate. First, let’s create and populate our table:

SQL> create table bowie (id number, code1 number, code2 number);

Table created.

SQL> insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual  connect by level <= 1000000;

1000000 rows created.

SQL> commit;

Commit complete.

Notice I’m using a MOD function to generate a perfectly even distribution of data. I’ve noticed a few examples (such as that by Charles Hooper in the comments of the Quiz posting), in which the DBMS_RANDOM function is used. Note this will almost certainly generate data with enough natural skewness on a 1M table with 254 random values that when the outlier 255th value is introduced, it will qualify for a height-based histogram. Very easy way to test and find out. Simply generate the 1M data with 255 random values and I suggest a height-based histogram is created regardless.

OK, I’ll run some SQL to generate sufficient workload to qualify the columns for automatic histograms:

SQL> select * from bowie where id = 42;
SQL> select * from bowie where code1 = 42;
SQL> select * from bowie where code2 = 42;

BTW, the difference between the SIZE AUTO and SIZE SKEWONLY options, is that AUTO requires previous workload to suggest a histogram might be relevant, SKEWONLY does not.

If we were to collect statistics at this stage, we would notice that the second and third columns both have a Frequency-Based histogram as both columns only have 254 distinct values and so automatically qualify:

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          FREQUENCY

If we were to run a query using the third column, notice how the cardinality estimates aren’t too bad in this example:

SQL> select * from bowie where code2 > 600;

no rows selected

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |    13 |   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     1 |    13 |   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

There are no rows that are greater than 600 and so an estimate of 1 isn’t too bad at all.

OK, let’s add in this one, tiny little row and collect fresh, <strong>100% accurate statistics</strong> (Note: the accurate statistics is very important as Niall’s examples has demonstrated):

&nbsp;

SQL> insert into bowie values (1000001, 42, 99999999);

1 row created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          NONE

Note that the third column now has 255 distinct values and so no longer qualifies for the automatic Frequency-Based histogram. As most of its data is perfectly evenly distributed with just the one outlier value, the column doesn’t qualify for a Height-based histogram either and so now has no histogram at all.

Note as I collected 100% accurate statistics, Oracle is definitely aware of this outlier value:

SQL> select column_name, low_value, high_value from dba_tab_columns where table_name='BOWIE' and column_name='CODE2';

COLUMN_NAME  LOW_VALUE  HIGH_VALUE
------------ ---------- ------------
CODE2        80         C464646464

SQL> var high_num number
SQL> exec dbms_stats.convert_raw_value('C464646464',:high_num);

PL/SQL procedure successfully completed.

SQL> print high_num

HIGH_NUM
----------
99999999

But it’s not enough for Oracle to automatically generate a histogram. Which is a shame really, because now we can have all sorts of problems:

SQL> select * from bowie where code2 > 600;
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |   999K|    12M|   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |   999K|    12M|   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

When previously it had the cardinality estimates spot on, now they’re terrible (expecting not 1 row but 999,000 rows !!) because without a histogram, Oracle is assuming even distribution between its low and high point values.

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY options 😉

Hope you’re enjoying these little quizzes, I’ll have another one for you all soon.

1. Steve C - September 1, 2011

Nice. That’s a fun problem to think through. I guess the only caveat that we should mention in this discussion is that because we are talking “AUTO” Oracle could change this behavior at any future release.

Like

Richard Foote - September 1, 2011

Hi Steve

True.

Especially true if the optimizer folks listen to some of my suggestions 😉

Like

2. Yogesh Tiwari - September 1, 2011

Got it.

Now, I m just left with one question…in last article’s comment section..Niall and you discussed, that it (this behaviour) depends on release/version as well. Kindly elaborate on that too.

Like

Richard Foote - September 1, 2011

Hi Yogesh

You’ll find this behaviour in all versions of Oracle since the very earliest 10g to the very lastest 11g.

Like

3. Tony Sleight - September 2, 2011

That’s a great article, thanks for publishing. I have just tried out a version with both MOD and DBMS_RANDOM.VALUE and your hypothesis is borne out.

TESTDB01 USER  > drop table t1;

Table dropped.

Elapsed: 00:00:02.28
TESTDB01 USER  > create table t1 (id1 number, id2 number, id3 number, id4 number, id5 number);

Table created.

Elapsed: 00:00:00.18
TESTDB01 USER  > insert into t1 select rownum, mod(rownum, 254), mod(rownum*3, 254), trunc(dbms_random.value(0,254)), trunc(dbms_random.value(0, 254)) from dual connect by level < 1000001;

1000000 rows created.

Elapsed: 00:00:17.60
TESTDB01 USER  > commit;

Commit complete.

Elapsed: 00:00:00.01
TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'USER',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

Elapsed: 00:00:22.10
TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000000    .000001           1 NONE
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     254   .0000005         254 FREQUENCY
ID4                                     254   .0000005         254 FREQUENCY
ID5                                     254   .0000005         254 FREQUENCY

Elapsed: 00:00:00.03
TESTDB01 USER  > insert into t1 values (1000001, 42, 999999999, 42, 99999999);

1 row created.

Elapsed: 00:00:00.01
TESTDB01 USER  > commit;

Commit complete.

Elapsed: 00:00:00.01
TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'USER',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

Elapsed: 00:00:13.33
TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000001    .000001           1 NONE
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     255 .003921569           1 NONE
ID4                                     254   .0000005         254 FREQUENCY
ID5                                     255 .003937025         254 HEIGHT BALANCED

Elapsed: 00:00:00.01
TESTDB01 USER  >

you can see the MOD versions only have a histogram on ID3. The random value column with the outlier value has indeed been converted to a height balanced histogram by luck of my random skewness of column ID5

Like

Richard Foote - September 2, 2011

Thanks for that Tony.

It does highlight how data that is not particularly skewed is deemed skewed enough by stats collection.

Like I said, not a great fan.

Like

4. mdinh - September 2, 2011

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY options.

What would you suggest as a better alternative or process?

Like

Richard Foote - September 2, 2011

Hi Mdinh

I only used SIZE 1 and then collect histograms just on those cases where its been necessary. Comes down to knowing your data, knowing your application and perhaps hitting issues previously where a histogram has proven to be beneficial.

The vast majority of columns I find don’t need histograms so just collect them on those that do.

Like

5. zhwsh - September 2, 2011

for example

hint:
WordPress.com

This file requires authorization:

You must both be a user of this blog as well as be currently logged into WordPress.com

You can create your own free blog on WordPress.com.

Like

Richard Foote - September 2, 2011

Hi Zhwsh

Yes, I know it’s a pain. Will get around to converting them, ummmm, one day 🙂

Like

6. Randolf Geist - September 3, 2011

Now that the questions raised regarding the generation of histograms by the SIZE AUTO option have been addressed, I think it is worth to point out that these “excess” frequency histograms that get generated for columns with a workload, uniform distribution and a low number of distinct values can have (more or less) subtle side effects one should be aware of:

– In general these histograms will take up space in the corresponding SYS cluster (C_OBJ#_INTCOL#), for larger databases with many (sub)partitions this can become significant, as the histograms might become collected for every (sub)partition

– These histograms potentially allocate space in the dictionary cache

– For every histogram an additional, separate pass is required while gathering statistics, so this can have a significant impact on the time it takes to gather statistics

– Join cardinality estimates will use a completely different code path in the optimizer if both sides of the join have histograms on the expressions/columns used, and this can lead to different cardinality estimates

– For versions >= 9.2 and <= 10.2.0.3:
For equal comparisons with values existing in the frequency histogram you'll end up with the famous “mystery of halving”, so the cardinality estimates will be off by 50%. Not a big problem on its own, but combine several of them and you might have a real cardinality estimate problem

– For versions >= 10.2.0.4
These versions address above “mystery of halving” problem, but they introduce another potential issue: Values that don’t exist in the frequency histogram no longer are estimated with a cardinality estimate of 1 but with half of the least popular value in the histogram. This can again lead to interesting scenarios, like this one.

So the basic idea behind this strategy seems to be good, since with a low number of distinct column values a frequency histogram gives the optimizer the perfect picture of the data distribution, but there are implementation details and side effects that seem to suggest using SIZE 1 and only generate histograms where appropriate and beneficial can be the better approach if one knows its data and application.

Like

Richard Foote - September 3, 2011

Hi Randolf

Thanks so very much for your comments, much appreciated. I couldn’t agree more.

In one of our large production systems we accidently left the automatic stats job running when we first migrated from 9i to 10g. In one fell swoop, we went from about 10 histograms which we manually collected to several 10,000s. The results were disasterous, resulting in row cache latching hell, CPU continously at 100% with parsing overheads and execution plans all over the place.

Like I said, I’m not a great fan of SIZE AUTO 🙂

Like

7. Frederick Tang - January 20, 2012

Hi Richard,

Thanks for the blog. What are some ideas to solve bad plan problems, caused by the fact that we don’t have histogram, when using column 3 as the predicate?

Fred

Like

Richard Foote - February 8, 2012

Hi Fred

If a bad plan can be resolved by having a histogram on a column, then sure go for it. That’s why Oracle has histograms 🙂

The point I was trying to make was to only collect histograms when they’re useful.

Like

8. Amardeep Sidhu - April 29, 2012

May be I am being stupid but where is the Quiz ? 😦

Like

Richard Foote - May 7, 2012