##
METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster)
*September 1, 2011*

*Posted by Richard Foote in CBO, Histograms, Oracle Indexes, Oracle Statistics.*

16 comments

16 comments

I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !!

Unfortunately, some might be disappointed at both Oracle and myself :)

Yes, I did kinda set things up to trick the unwary and yes, perhaps the answer isn’t what many are expecting.

The answer to my previous question of which column is going to have a histogram when using the METHOD_OPT SIZE AUTO option is in fact **Column 2**. Well done to everyone who got it right.

Why ?

The simplest answer is because it’s the only column of the three that has **254 or less distinct values**.

Here’s the key point. When using METHOD_OPT SIZE AUTO, every column with 254 or less distinct values that has been referenced within a predicate, will have a Frequency-based histogram. Each and every one of them, regardless of whether the data is actually skewed or not. So Column 2 with only 254 distinct values AND having previously been referenced in a predicate was guaranteed to have a histogram.

If a column has more than 254 distinct values, whether it then has a Height-Based histogram depends on how the data is skewed. If the data is perfectly evenly distributed, then it won’t have a histogram. Column 1, having sequenced based unique values will not meet the criteria and so not have a histogram.

Column 3 is interesting. Having inserted the outlier value, it now has 255 distinct values and so no longer qualifies for an automatic frequency based histogram. However, if all its values are evenly distributed, then it won’t qualify for a height based histogram either and Column 3 only has just the one outlier value, all other values are evenly distributed values. Unfortunately, Oracle doesn’t pick up on rare outlier values (even if you collect 100% statistics and it’s one of the low/high points of the column) and so will not generate a height-based histogram.

The only column that qualifies is Column 2.

A demo to illustrate. First, let’s create and populate our table:

SQL> create table bowie (id number, code1 number, code2 number); Table created. SQL> insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual connect by level <= 1000000; 1000000 rows created. SQL> commit; Commit complete.

Notice I’m using a MOD function to generate a **perfectly even distribution of data**. I’ve noticed a few examples (such as that by Charles Hooper in the comments of the Quiz posting), in which the DBMS_RANDOM function is used. Note this will almost certainly generate data with enough natural skewness on a 1M table with 254 random values that when the outlier 255th value is introduced, it will qualify for a height-based histogram. Very easy way to test and find out. Simply generate the 1M data with 255 random values and I suggest a height-based histogram is created regardless.

OK, I’ll run some SQL to generate sufficient workload to qualify the columns for automatic histograms:

SQL> select * from bowie where id = 42; SQL> select * from bowie where code1 = 42; SQL> select * from bowie where code2 = 42;

BTW, the difference between the SIZE AUTO and SIZE SKEWONLY options, is that AUTO requires previous workload to suggest a histogram might be relevant, SKEWONLY does not.

If we were to collect statistics at this stage, we would notice that the second and third columns both have a Frequency-Based histogram as both columns only have 254 distinct values and so automatically qualify:

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true); PL/SQL procedure successfully completed. SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE'; COLUMN_NAME HISTOGRAM ------------------------------ --------------- ID NONE CODE1 FREQUENCY CODE2 FREQUENCY

If we were to run a query using the third column, notice how the cardinality estimates aren’t too bad in this example:

SQL> select * from bowie where code2 > 600; no rows selected Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 1 | 13 | 660 (2)| 00:00:08 | |* 1 | TABLE ACCESS FULL| BOWIE | 1 | 13 | 660 (2)| 00:00:08 | ---------------------------------------------------------------------------

There are no rows that are greater than 600 and so an estimate of 1 isn’t too bad at all.

OK, let’s add in this one, tiny little row and collect fresh, <strong>100% accurate statistics</strong> (Note: the accurate statistics is very important as Niall’s examples has demonstrated):

SQL> insert into bowie values (1000001, 42, 99999999); 1 row created. SQL> commit; Commit complete. SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true); PL/SQL procedure successfully completed. SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE'; COLUMN_NAME HISTOGRAM ------------------------------ --------------- ID NONE CODE1 FREQUENCY CODE2 NONE

Note that the third column now has 255 distinct values and so no longer qualifies for the automatic Frequency-Based histogram. As most of its data is perfectly evenly distributed with just the one outlier value, the column doesn’t qualify for a Height-based histogram either and so now has no histogram at all.

Note as I collected 100% accurate statistics, Oracle is definitely aware of this outlier value:

SQL> select column_name, low_value, high_value from dba_tab_columns where table_name='BOWIE' and column_name='CODE2'; COLUMN_NAME LOW_VALUE HIGH_VALUE ------------ ---------- ------------ CODE2 80 C464646464 SQL> var high_num number SQL> exec dbms_stats.convert_raw_value('C464646464',:high_num); PL/SQL procedure successfully completed. SQL> print high_num HIGH_NUM ---------- 99999999

But it’s not enough for Oracle to automatically generate a histogram. Which is a shame really, because now we can have all sorts of problems:

SQL> select * from bowie where code2 > 600; Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 999K| 12M| 660 (2)| 00:00:08 | |* 1 | TABLE ACCESS FULL| BOWIE | 999K| 12M| 660 (2)| 00:00:08 | ---------------------------------------------------------------------------

When previously it had the cardinality estimates spot on, now they’re terrible (expecting not 1 row but 999,000 rows !!) because without a histogram, Oracle is assuming even distribution between its low and high point values.

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY options ;)

Hope you’re enjoying these little quizzes, I’ll have another one for you all soon.

##
METHOD_OPT => SIZE AUTO Quiz (Automatic For The People)
*August 31, 2011*

*Posted by Richard Foote in Method_Opt Size AUTO, Oracle Indexes, Oracle Statistics.*

40 comments

40 comments

OK, a nice, easy, simple one today. No tricks, honest ;)

You have a table with 3 columns and lets say 1M rows.

Column 1 is effectively unique, so it has 1M distinct values. Let’s say 1 – 1000000, not that it really matters.

Column 2 has 254 distinct values, all evenly distributed so it has approximately the same number of rows for each value. Let’s say the values are 0-253 again it doesn’t really matter.

Column 3 is identical to Column 2, it also has 254 distinct values, all evenly distributed as well such that it also has approximately the same number of rows for each value. Let’s say the range of values are the same, 0-253, again it doesn’t really matter.

You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42).

You then insert just one row that has the following values based on our example: VALUES (1000001, 42, 99999999).

The key points here is that for Column1, it’s just another unique value, just 1 greater than the previous maximum value. Nothing special.

For Column2, it’s just another of the existing 254 values that doesn’t really change the even distribution of the data. Nothing special.

However, for Column 3, it’s not only a new value that didn’t previously exist (and so there’s just the one row with this value in the data, whereas all the other values correspond to roughly 1/254 of the rows) but it’s also a value that is way way way outside the normal range of existing values (nothing comes close to having a value of 99999999).

OK, we have the scenario, hopefully you can see where I going with this.

You decide to collect fresh statistics with DBMS_STATS, you want them to be completely accurate so you use a **100% sample size (or compute with estimate_percent=>null)**. But because you want to get with the AUTO program and make life easier for yourself, you decide to let Oracle work out which columns might require and benefit from a histogram by using METHOD_OPT=>’ FOR ALL COLUMNS **SIZE AUTO’**.

**Now finally comes the question. Of the three columns, only one column will have a histogram. Which one and why is it so ?**

If you understand how Oracle collects statistics, the answer will hopefully be obvious ;)