jump to navigation

METHOD_OPT => SIZE AUTO Quiz (Automatic For The People) August 31, 2011

Posted by Richard Foote in Method_Opt Size AUTO, Oracle Indexes, Oracle Statistics.
trackback

OK, a nice, easy, simple one today. No tricks, honest ;)

You have a table with 3 columns and lets say 1M rows.

Column 1 is effectively unique, so it has 1M distinct values. Let’s say 1 – 1000000, not that it really matters.

Column 2 has 254 distinct values, all evenly distributed so it has approximately the same number of rows for each value. Let’s say the values are 0-253 again it doesn’t really matter.

Column 3 is identical to Column 2, it also has 254 distinct values, all evenly distributed as well such that it also has approximately the same number of rows for each value. Let’s say the range of values are the same, 0-253, again it doesn’t really matter.

You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42).

You then insert just one row that has the following values based on our example: VALUES (1000001, 42, 99999999).

The key points here is that for Column1, it’s just another unique value, just 1 greater than the previous maximum value. Nothing special.

For Column2, it’s just another of the existing 254 values that doesn’t really change the even distribution of the data. Nothing special.

However, for Column 3, it’s not only a new value that didn’t previously exist (and so there’s just the one row with this value in the data, whereas all the other values correspond to roughly 1/254 of the rows) but it’s also a value that is way way way outside the normal range of existing values (nothing comes close to having a value of 99999999).

OK, we have the scenario, hopefully you can see where I going with this.

You decide to collect fresh statistics with DBMS_STATS, you want them to be completely accurate so you use a 100% sample size (or compute with estimate_percent=>null). But because you want to get with the AUTO program and make life easier for yourself, you decide to let Oracle work out which columns might require and benefit from a histogram by using METHOD_OPT=>’ FOR ALL COLUMNS SIZE AUTO’.

Now finally comes the question. Of the three columns, only one column will have a histogram. Which one and why is it so ?

If you understand how Oracle collects statistics, the answer will hopefully be obvious ;)

About these ads

Comments»

1. Marcus Mönnig - August 31, 2011

You wrote no tricks, right? :-)

Since you asked for a 100% sample size, Oracle will definetly see the 99999999, thus know about the skewed data in column 3 and create a histogram for that column.

If this is not correct, I’ll be disappointed in both, you and Oracle. ;-)

Richard Foote - August 31, 2011

Hi Marcus

Oracle will definitely see the 99999999 value and have it as the highpoint for the column. Definitely.

I won’t give anything away yet but come on, you trust me don’t you :)

2. Niall Litchfield - August 31, 2011

Assuming a currently supported db version? I *believe* the correct answer is version dependent.

Richard Foote - August 31, 2011

Hi Niall

Yes, the answer I’m after should be applicable for any 10g or above version on any platform. Would certainly be interested if you can get different results with any specific version (but I doubt it).

Niall Litchfield - September 1, 2011

Hi Richard,

I thought that http://orawin.info/blog/2011/08/31/histograms/ was a little long for a comment! I had a little difficulty interpreting “You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42)” because the example doesn’t match my understanding of the words; so I tested both.

It appears to me that in 11.2.0.2 the 3rd column switches from a frequency to a height balanced histogram (sort of the answer I expected, but not quite, and not what you implied by the question), whereas in 10.2.0.4 it didn’t.

Of course I’ve probably misunderstood the question (hence the use of a test case so people can quibble with it and play around and so on). If running on 9i then the purge will need to go from drop table and if you have a table called afp then read the script carefully!!

Niall Litchfield - September 1, 2011

Charles Hooper pointed out that I’d missed the requirement to calculate statistics (estimate_percent => null). When I repeat my test with that condition catered, I am surprised by the results – but it’s beer o’clock now so further thoughts may be delayed and even more incoherent.

3. Yogesh Tiwari - August 31, 2011

Answer: None.

I tried this on 11.2.0.2.3.

4. Yogesh Tiwari - August 31, 2011

Oops…I didnt run a workload…let me do it, and then I ll gather stats again. :)

Richard Foote - August 31, 2011

Hi Yogesh

Yes, with no workload, no histograms would be generated.

Before running a test, what would you expect the column to be ?

Don’t forget, I need a “why” answer as well ;)

5. Charles Hooper - August 31, 2011

If I remember correctly, a histogram on the third column would have to be a height balanced histogram, because there can be a maximum of 254 buckets into which the column values could be associated. A histogram on the second column could be a frequency based histogram (a bit more useful to the optimizer) because there are still 254 or fewer distinct values.

What is interesting is that I see three different results:
If no select statements are performed against the table, there are no histograms generated:

SELECT
  SUBSTR(COLUMN_NAME,1,3) COL,
  DENSITY,
  NUM_BUCKETS,
  LAST_ANALYZED
FROM
  DBA_TAB_COLUMNS
WHERE
  TABLE_NAME='T10'
ORDER BY
  1;
  
COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1     .000001           1 31-AUG-11
C2  .003937008           1 31-AUG-11
C3  .003921569           1 31-AUG-11

If the data for the second and third columns are generated with the MOD function, only a histogram on the second column is produced:

SELECT
  SUBSTR(COLUMN_NAME,1,3) COL,
  DENSITY,
  NUM_BUCKETS,
  LAST_ANALYZED
FROM
  DBA_TAB_COLUMNS
WHERE
  TABLE_NAME='T11'
ORDER BY
  1;
  
COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1     .000001           1 31-AUG-11
C2    .0000005         254 31-AUG-11
C3  .003921569           1 31-AUG-11

If the values for the second and third columns are populated randomly with TRUNC(DBMS_RANDOM.VALUE(0,254)) then a histogram is generated on the second and third columns:

SELECT
  SUBSTR(COLUMN_NAME,1,3) COL,
  DENSITY,
  NUM_BUCKETS,
  LAST_ANALYZED
FROM
  DBA_TAB_COLUMNS
WHERE
  TABLE_NAME='T10'
ORDER BY
  1;
 
COL    DENSITY NUM_BUCKETS LAST_ANAL
--- ---------- ----------- ---------
C1     .000001           1 31-AUG-11
C2    .0000005         254 31-AUG-11
C3  .003934458         254 31-AUG-11
Charles Hooper - September 1, 2011

Richard,

I am confused why the DENISITY value for column C2 changed from 0.003937008 to 0.0000005 when queries have been executed against the table? This was the case for the table populated with MOD and the table populated with DBMS_RANDOM.VALUE.

Niall Litchfield - September 1, 2011

Because oracle uses column usage (columns in where clauses) when deciding which columns are candidates for generating histograms on. This means that if you gathered stats last week, this week someone ran a new query, and you have method_opt set as above you’ll quite possibly get new stats (and cardinality estimates) for your queries. If you are coming to UKOUG I cover some of this in the graveyard slot on Wed pm (oracle calculates an expected cardinality for this talk as 1 :) ).

Martin Preiss - September 1, 2011

strange results – and I got them too on 11.2.0.1. Maybe DBMS_RANDOM brings a little bit more skew into the data than MOD?

Reading in Wolfgang Breitling’s Paper “Histograms – Myths and Facts” (http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.pdf) I see that “size auto” is the same as “size skewonly” for columns used in predicates. There’s also an explanation for the strange density .0000005 (= 1/(num_rows * 2) for frequency histograms).

In my tests with MOD there are histograms on both columns when I create the table, issue a query and gather statistics. After adding the additional row with VALUES (1000001, 42, 99999999) there remains only one histogram for the second column. With Event 10046 I see queries like:

select min(minbkt),maxbkt,substrb(dump(min(val),16,0,32),1,120) minval,
  substrb(dump(max(val),16,0,32),1,120) maxval,sum(rep) sumrep, sum(repsq) 
  sumrepsq, max(rep) maxrep, count(*) bktndv, sum(case when rep=1 then 1 else 
  0 end) unqrep 
from
 (select val,min(bkt) minbkt, max(bkt) maxbkt, count(val) rep, count(val)
  *count(val) repsq from (select /*+ no_parallel(t) no_parallel_index(t) 
  dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) 
  no_monitoring no_substrb_pad */"COL1" val, ntile(254) over (order by "COL1")
   bkt  from "DBADMIN"."TEST_HISTOGRAMS" t  where "COL1" is not null) group 
  by val) group by maxbkt order by maxbkt
Martin Preiss - September 1, 2011

And it seems that the decision to create the histogram is somehow based on BKTNDV and UNQREP that are 2 and 1 for the last ntile-row (with the maxvalue: Typ=2 Len=5: c4,64,64,64,64 aka 99999999). But I have no idea what that means…

Regards

Martin

Richard Foote - September 1, 2011

Hi Martin

Indeed. With 1M rows and 254 dbms_random values, there’s likely enough natural skewness to warrant a histogram.

As I describe in my solution post, by adding the 255th value, the column no longer qualifies as an automatic frequency-based histogram.

And indeed, the density calculations differ between both No, Frequency and Height based histograms.

Richard Foote - September 1, 2011

Hi Charles.

With no workload, AUTO will not generate histograms regardless.

With MOD, data is perfectly evenly distributed and so no histograms once you get the 255th distinct value.

With Random, there’s likely enough natural skewness to make Oracle go for the height based histogram.

To test this out, see what happens when you populate the table with 255 randonly generated values and you’ll likely still get the histogram.

6. Yogesh Tiwari - September 1, 2011

Answer:

create table hist_test (id1 number, id2 number, id3 number);

insert into hist_test (id1, id2, id3) select rownum, mod(rownum,254), mod(rownum,254) from dual connect by level true,estimate_percent =>null,method_opt => 'for all columns size auto');

select table_name, column_name, num_distinct, density, num_buckets, histogram from dba_tab_columns where table_name = 'HIST_TEST';

TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------------------------ ------------ ---------- ----------- ---------------
HIST_TEST ID3 255 .003921569 1 NONE
HIST_TEST ID2 254 .0000005 254 FREQUENCY
HIST_TEST ID1 1000001 .000001 1 NONE

select /*+ gather_plan_statistics */ count(*) from hist_test where id3>999999;

COUNT(*)
----------
2

select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.04 | 2392 |
|* 2 | TABLE ACCESS FULL| HIST_TEST | 1 | 990K| 1 |00:00:00.04 | 2392 |

create index hist_id3x on hist_test(id3);

select /*+ gather_plan_statistics */ count(*) from hist_test where id3>999999;

COUNT(*)
----------
2

select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));

------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.71 | 2051 | 2040 |
|* 2 | INDEX FAST FULL SCAN| HIST_ID3X | 1 | 990K| 1 |00:00:00.71 | 2051 | 2040 |

Result:

None on ID1
Frequency on ID2
None on ID3

I got same result on 10.2.0.4

I m assuming..(end of quiz will definitely solve my dount..I know)…

Now, since id3 has 255 distinct values, if oracle creates histogram, it ll be a HBH, therefore, it ll not know, how many values ‘d be there in last bucket..(as it ll divide values from 0-1000001 into 254 buckets.) That is why it doesnt create a histogram.

However, if we delete a few rows says…and gather stats again

delete from hist_test where id3=253;

commit;

exec dbms_stats.gather_table_stats(‘TECHM’,’HIST_TEST’,cascade => true,estimate_percent =>null,method_opt => ‘for all columns size auto’);

select table_name, column_name, num_distinct, density, num_buckets, histogram from dba_tab_columns where table_name = ‘HIST_TEST';

TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
—————————— ——————– ———— ———- ———– —————
HIST_TEST ID1 996064 .000001004 1 NONE
HIST_TEST ID2 253 5.0198E-07 253 FREQUENCY
HIST_TEST ID3 254 5.0198E-07 254 FREQUENCY

It gets a frequency histogram on id3.

Does it answer it “Why”…I dont know.

Richard Foote - September 1, 2011

Hi Yogesh

Hopefully the solution post will make everything clear now :)

254 is the magic number when it comes to frequency based histograms being automatically generated.

7. Gokhan Atil - September 1, 2011

Richard,

IMHO, Column2 will be the only column having a histogram. Although you said there was no trick, I believe that there’s a trick here.

Maximum number of histogram buckets is 254. When we insert an extreme number to Column3, Oracle will need to put this new distinct value to a new bucket, but it won’t be able to add 255th bucket. So it will try to recalculate the buckets by redistributing all distinct values. This extreme number will force (mislead) Oracle to distribute all numbers in one bucket.

If you have less than 254 distinct values (or less than 254 buckets), adding this extreme number will not prevent creating histogram for Column3.

Regards

Gokhan

Richard Foote - September 1, 2011

Hi Gokhan

Well done on part I.

However, it’s not the extreme number that causes the lack of histogram, it’s the fact it’s the 255th distinct value and there’s not enough natural skewness in most of the data.

Gokhan Atil - September 2, 2011

Hi Richard,

I do my tests on 10.2.0.5 and in my tests, I see that “extreme number” is an important factor. If I insert “260” instead of 99999999, oracle creates “height balanced” histogram:

insert into bowie values (1000001, 42, 260 );
commit;

exec dbms_stats.gather_table_stats( user, tabname=>’BOWIE’, estimate_percent=> null, cascade=>true);

select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from dba_tab_columns where table_name = ‘BOWIE';
COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
—————————— ———————- ———————- ———————- —————
ID 1000001 0,000000999999000001 1 NONE
CODE1 254 0,0000004999995000005 254 FREQUENCY
CODE2 255 0,00393699631800763 254 HEIGHT BALANCED

Regards

Gokhan

8. histograms « Ukrainian Oracle User Group - September 1, 2011

[…] post is by way of reply to Richard Foote’s latest quiz available here Richard wants to know how many columns a histogram will be collected on given certain conditions. I […]

9. Tony Sleight - September 1, 2011

I seem to have different results to the above comments. My initial thought was that only column 2 would contain a histogram as all that had happened was the data distribution had changed slightly by the insert.
However, I then wondered why column 3 would not just change to a height balanced histogram.

So I decided to test the theory. I am using Oracle 11G R2 on a Windows 32 bit XP laptop. The results are pasted below:

TESTDB01 USER  > create table t1 (id1 number, id2 number, id3 number);

Table created.

TESTDB01 USER  > insert into t1 select rownum, mod(rownum, 254), mod(rownum*3, 254) from dual connect by level < 1000001;

1000000 rows created.

TESTDB01 USER  > commit;

Commit complete.

TESTDB01 USER  > select id2, count(*) from t1 group by id2;

       ID2   COUNT(*)
---------- ----------
       227       3937
       244       3937
       245       3937
       247       3937
         1       3938
        22       3937
:
:  rows left out for brevity
:
        79       3937
        98       3937
       118       3937
       122       3937
       150       3937
       159       3937

254 rows selected.

TESTDB01 USER  > select id3, count(*) from t1 group by id3;

       ID3   COUNT(*)
---------- ----------
       152       3937
       167       3937
       173       3937
       179       3937
       227       3937
:
:  rows left out for brevity
:
        67       3937
        73       3937
        79       3937
       118       3937
       187       3937
        98       3937

254 rows selected.

TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'LOG',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE 254');

PL/SQL procedure successfully completed.

TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000000    .000001         254 HEIGHT BALANCED
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     254   .0000005         254 FREQUENCY

TESTDB01 USER  > insert into t1 values (1000001, 128, 999999);

1 row created.

TESTDB01 USER  > commit;

Commit complete.

TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'LOG',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE 254');

PL/SQL procedure successfully completed.

TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000001    .000001         254 HEIGHT BALANCED
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     255    .003937         254 HEIGHT BALANCED

TESTDB01 USER  >

TESTDB01 SYSTEM  > select * from sys.aux_stats$;

SNAME                          PNAME                               PVAL1
------------------------------ ------------------------------ ----------
SYSSTATS_INFO                  STATUS				COMPLETED
SYSSTATS_INFO                  DSTART				02-08-2011 09:17
SYSSTATS_INFO                  DSTOP				02-08-2011 09:17
SYSSTATS_INFO                  FLAGS                                   1
SYSSTATS_MAIN                  CPUSPEEDNW                           1947
SYSSTATS_MAIN                  IOSEEKTIM                              10
SYSSTATS_MAIN                  IOTFRSPEED                           4096
SYSSTATS_MAIN                  SREADTIM                                2
SYSSTATS_MAIN                  MREADTIM                                3
SYSSTATS_MAIN                  CPUSPEED                             1947
SYSSTATS_MAIN                  MBRC                                   15
SYSSTATS_MAIN                  MAXTHR                           20079616
SYSSTATS_MAIN                  SLAVETHR

The big surprise was an initial height balanced histogram on ID1 and the second surprise was a conversion of a frequency histogram on ID3 to a height balanced histogram which seems to go against the flow from previous posts on this question.

Tony Sleight - September 1, 2011

I notice my error, I have mistakenly used ‘SIZE 254′ instead of ‘SIZE AUTO’.

It’s early in the morning and I have not had enough caffeine yet!

Niall Litchfield - September 1, 2011

Tony

I’m not sure I’d necessarily describe it as an error, just an illustration that the behaviour of DBMS_STATS is *very* dependent upon the exact parameters passed (or set in the environment) as well as underlying data distribution.

Richard Foote - September 1, 2011

Hi Tony

When I don’t have my morning coffee by 9:30am, well just ask my colleagues at work what a disaster that can be :)

Tony Sleight - September 1, 2011

I now agree with the previous posts, There are initially two frequency histograms, one on ID2 and one on ID3. After the insert the histogram on ID3 is lost. However, I cannot explain why the frequency histogram on ID3 is not replaced by a height balanced histogram. Especially considering my flawed previous test did show a conversion from frequency to height balanced.

TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'USER',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

Elapsed: 00:00:05.73
TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000000    .000001           1 NONE
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     254   .0000005         254 FREQUENCY

Elapsed: 00:00:00.07
TESTDB01 USER  > insert into t1 values (1000001, 42, 999999999);

1 row created.

Elapsed: 00:00:00.07
TESTDB01 USER  > commit;

Commit complete.

Elapsed: 00:00:00.01
TESTDB01 USER  > exec dbms_stats.gather_table_stats(ownname=>'USER',tabname=>'T1', estimate_percent=>null, method_opt=>'FOR ALL COLUMNS SIZE AUTO');

PL/SQL procedure successfully completed.

Elapsed: 00:00:05.74
TESTDB01 USER  > select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from user_tab_columns where table_name = 'T1';

COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------ ---------- ----------- ---------------
ID1                                 1000001    .000001           1 NONE
ID2                                     254   .0000005         254 FREQUENCY
ID3                                     255 .003921569           1 NONE

Elapsed: 00:00:00.01
TESTDB01 USER  >
Richard Foote - September 1, 2011

Hi Tony

It’s because Oracle is just not very good at spotting outlier values and its possible significance when deciding when to go for a histogram or not.

The only reason it had the frequency based histogram is just because there were 254 or less distinct values, that’s its only justification.

10. Richard Foote - September 1, 2011

@Niall

Yes, the key bit you missed in your example is having to collect 100% accurate stats. Without that, Oracle may miss the outlier value, not realise it has more than 254 distinct values and create a histogram regardless.

11. mdinh - September 1, 2011

Using Nial’s modified script.

Was curious as to why I was not getting Frequency histogram and it turns out that I had one too many bucket originally, 255 vs 254..

create table afp as
select 
level c1, 
mod(round(dbms_random.value(1,100000)),254) c2, 
mod(round(dbms_random.value(1,100000)),254) c3 
from dual connect by level < 1000001;


11.2.0.2.0 - 64bit Production
 
create table afp as
select 
level c1, 
mod(round(dbms_random.value(1,100000)),254) c2, 
mod(round(dbms_random.value(1,100000)),254) c3 
from dual connect by level < 1000001;

'Histogram Info with no queries'

TABLE_NAME COLUMN_NAME NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ----------- ------------ ---------- ----------- ----------------------------------
AFP        C1               1000000    .000001           1 NONE
AFP        C2                   254 .003937008           1 NONE
AFP        C3                   254 .003937008           1 NONE

'Histogram Info with single col queries'

TABLE_NAME COLUMN_NAME NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ----------- ------------ ---------- ----------- ----------------------------------
AFP        C1               1000000    .000001           1 NONE
AFP        C2                   254   .0000005         254 FREQUENCY
AFP        C3                   254   .0000005         254 FREQUENCY

'Histogram Info with queries on all columns'

TABLE_NAME COLUMN_NAME NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ----------- ------------ ---------- ----------- ----------------------------------
AFP        C1               1000000    .000001           1 NONE
AFP        C2                   254   .0000005         254 FREQUENCY
AFP        C3                   254   .0000005         254 FREQUENCY

'add data '

1 row created.


Commit complete.

'Histogram Info with no queries'

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ------------------------------ ------------ ---------- ----------- ---------------
AFP        C1                                  1000001    .000001           1 NONE
AFP        C2                                      254   .0000005         254 FREQUENCY
AFP        C3                                      255 .003934596         254 HEIGHT BALANCED

'Histogram Info with single col queries'

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ------------------------------ ------------ ---------- ----------- ---------------
AFP        C1                                  1000001    .000001           1 NONE
AFP        C2                                      254   .0000005         254 FREQUENCY
AFP        C3                                      255 .003934596         254 HEIGHT BALANCED

'Histogram Info with queries on all columns'

TABLE_NAME COLUMN_NAME                    NUM_DISTINCT    DENSITY NUM_BUCKETS HISTOGRAM
---------- ------------------------------ ------------ ---------- ----------- ---------------
AFP        C1                                  1000001    .000001           1 NONE
AFP        C2                                      254   .0000005         254 FREQUENCY
AFP        C3                                      255 .003934596         254 HEIGHT BALANCED
Richard Foote - September 1, 2011

Yes, that one extra distinct value makes all the difference.

Note BTW that you’re using a random number to generate your values which explains why you’re getting a height based histogram after the 255th value as it likely is skewed enough still for stats collection to warrant a histogram.

12. Richard Foote - September 2, 2011

@Gokhan

How did you populate the other data ? I think you find you have natural skewness in your existing data that has resulted in the histogram and so has nothing to do with adding the additional row. Adding a new row with data that is perfectly evenly distributed shouldn’t result in a height-based histogram.

I’ve tried to simulate your example (in various 10/11g databases) and could only do so with data that is already somewhat skewed (such as using dbms_random to generate your data).

Gokhan Atil - September 2, 2011

Richard,

As I see the problem is how I populated the data. I have tried to use the your code to generate the data but I used two queries because “connect by” caused memory error:

insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual connect by level <= 500000;
insert into bowie select rownum + 500000, mod(rownum,254), mod(rownum,254) from dual connect by level <= 500000;
commit;

I haven't added 500000 to col2 and col3, so the data is not "perfectly" evenly distributed. After I inserted a new data, here's the distribution:

select code2, count(*) from bowie group by code2 order by code2 desc;

CODE2 COUNT(*)
———————- ———————-
260 1
253 3936
252 3936
251 3936

2 3938
1 3938
0 3936

255 rows selected

I have a question: You replied me "…it’s the fact it’s the 255th distinct value and there’s not enough natural skewness in most of the data…"

So the data I generated has "enough natural skewness" because it's not "perfectly even distribution of data"… How can we say the data I generated is naturally skewed?

Regards

Gokhan

Richard Foote - September 3, 2011

Hi Gokhan

IMHO, I don’t think we really can say your data in skewed. I don’t believe the data is really skewed in the examples that use the dbms_random function either. For most practical purposes, the average density would be close enough and sufficient.

Unfortunately, that’s not the way stats collection sees it and it will generally collect way way way more histograms than is actually necessary (and for the possible issues with this, see Randolf’s excellent comments). And yet still miss cases where it might be useful such as my outlier example.

So I just wanted to point out that is was the existing data and not the new 260 value that caused your histogram.

I’ll say again, I’m not a great fan of SIZE AUTO or SKEWONLY ;)

tonysleight - September 5, 2011

Gokhan,

If you’re interested in determining the skewness of your data I suggest you read this article

http://en.wikipedia.org/wiki/Skewness

The section on Sample Skewness and Parson’s skewness can be applied to your data using SQL. The difference between the skewness coefficient and the mean value tells you how you data is skewed.

13. METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster) « Richard Foote’s Oracle Blog - May 7, 2012

[…] answer to my previous question of which column is going to have a histogram when using the METHOD_OPT  SIZE AUTO option is in […]

14. Dax - May 16, 2012

Hi Richard,

Following is my test script and result in Oracle Database 11g Express Edition Release 11.2.0.2.0.

SQL> select temporary, count(1) from ind_test group by temporary order by temporary;

T COUNT(1)
– ———-
F 82
M 10
N 38040
Y 82

Elapsed: 00:00:00.00
SQL> exec dbms_stats.gather_table_stats(user,’ind_test’,method_opt=>’for columns temporary size 5′);

PL/SQL procedure successfully completed.

Elapsed: 00:00:00.02
SQL> select table_name, num_distinct, low_value, high_value, num_buckets, histogram from dba_tab_col_statistics where table_name = ‘IND_TEST';

TABLE_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM
—————————— ———— —————————————————————- —————————————————————- ———– —————
IND_TEST 4 46 59 4 FREQUENCY

Elapsed: 00:00:00.00
SQL> select column_name,endpoint_number,endpoint_value from user_tab_histograms where table_name = ‘IND_TEST’ ORDER BY ENDPOINT_NUMBER;

COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE
———- ——————– ——————–
TEMPORARY 7 3.63460780097438E+35
TEMPORARY 11 3.99806858107182E+35
TEMPORARY 5491 4.04999154965717E+35
TEMPORARY 5499 4.62114420409600E+35

Question : I could not understand ENDPOINT_NUMBER & ENDPOINT_VALUE? I was expecting ENDPOINT_NUMBER as 82,92,38142, 38224.. Will you pl. explain how this nos are generated?

Thanks
Dax

Richard Foote - May 16, 2012

Hi Dax

The reason you’re not getting the expected endpoint_number values is because you have only collected sample stats, not full compute stats. Add the estimate_percent parameter set to null and the numbers should be more accurate.

The endpoint_values are numeric representation of the actual values. Rather than going through it all, the following blog post by Martin Widlake does an excellent job of explaining it all:

http://mwidlake.wordpress.com/2009/08/11/decrypting-histogram-data/

Dax - May 24, 2012

Thank you very much Richard! Very good post from Martin

15. Richard Foote - May 24, 2012

Hi Dax.

No worries. Martin has written many excellent blog articles, worth checking out.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,913 other followers

%d bloggers like this: