METHOD_OPT => SIZE AUTO Quiz (Automatic For The People) August 31, 2011
Posted by Richard Foote in Method_Opt Size AUTO, Oracle Indexes, Oracle Statistics.trackback
OK, a nice, easy, simple one today. No tricks, honest 😉
You have a table with 3 columns and lets say 1M rows.
Column 1 is effectively unique, so it has 1M distinct values. Let’s say 1 – 1000000, not that it really matters.
Column 2 has 254 distinct values, all evenly distributed so it has approximately the same number of rows for each value. Let’s say the values are 0-253 again it doesn’t really matter.
Column 3 is identical to Column 2, it also has 254 distinct values, all evenly distributed as well such that it also has approximately the same number of rows for each value. Let’s say the range of values are the same, 0-253, again it doesn’t really matter.
You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42).
You then insert just one row that has the following values based on our example: VALUES (1000001, 42, 99999999).
The key points here is that for Column1, it’s just another unique value, just 1 greater than the previous maximum value. Nothing special.
For Column2, it’s just another of the existing 254 values that doesn’t really change the even distribution of the data. Nothing special.
However, for Column 3, it’s not only a new value that didn’t previously exist (and so there’s just the one row with this value in the data, whereas all the other values correspond to roughly 1/254 of the rows) but it’s also a value that is way way way outside the normal range of existing values (nothing comes close to having a value of 99999999).
OK, we have the scenario, hopefully you can see where I going with this.
You decide to collect fresh statistics with DBMS_STATS, you want them to be completely accurate so you use a 100% sample size (or compute with estimate_percent=>null). But because you want to get with the AUTO program and make life easier for yourself, you decide to let Oracle work out which columns might require and benefit from a histogram by using METHOD_OPT=>’ FOR ALL COLUMNS SIZE AUTO’.
Now finally comes the question. Of the three columns, only one column will have a histogram. Which one and why is it so ?
If you understand how Oracle collects statistics, the answer will hopefully be obvious 😉
You wrote no tricks, right? 🙂
Since you asked for a 100% sample size, Oracle will definetly see the 99999999, thus know about the skewed data in column 3 and create a histogram for that column.
If this is not correct, I’ll be disappointed in both, you and Oracle. 😉
LikeLike
Hi Marcus
Oracle will definitely see the 99999999 value and have it as the highpoint for the column. Definitely.
I won’t give anything away yet but come on, you trust me don’t you 🙂
LikeLike
Assuming a currently supported db version? I *believe* the correct answer is version dependent.
LikeLike
Hi Niall
Yes, the answer I’m after should be applicable for any 10g or above version on any platform. Would certainly be interested if you can get different results with any specific version (but I doubt it).
LikeLike
Hi Richard,
I thought that http://orawin.info/blog/2011/08/31/histograms/ was a little long for a comment! I had a little difficulty interpreting “You have various queries in the database in which all 3 columns are referenced somewhere in WHERE conditions (eg. WHERE Column1 = 42)” because the example doesn’t match my understanding of the words; so I tested both.
It appears to me that in 11.2.0.2 the 3rd column switches from a frequency to a height balanced histogram (sort of the answer I expected, but not quite, and not what you implied by the question), whereas in 10.2.0.4 it didn’t.
Of course I’ve probably misunderstood the question (hence the use of a test case so people can quibble with it and play around and so on). If running on 9i then the purge will need to go from drop table and if you have a table called afp then read the script carefully!!
LikeLike
Charles Hooper pointed out that I’d missed the requirement to calculate statistics (estimate_percent => null). When I repeat my test with that condition catered, I am surprised by the results – but it’s beer o’clock now so further thoughts may be delayed and even more incoherent.
LikeLike
Answer: None.
I tried this on 11.2.0.2.3.
LikeLike
Oops…I didnt run a workload…let me do it, and then I ll gather stats again. 🙂
LikeLike
Hi Yogesh
Yes, with no workload, no histograms would be generated.
Before running a test, what would you expect the column to be ?
Don’t forget, I need a “why” answer as well 😉
LikeLike
If I remember correctly, a histogram on the third column would have to be a height balanced histogram, because there can be a maximum of 254 buckets into which the column values could be associated. A histogram on the second column could be a frequency based histogram (a bit more useful to the optimizer) because there are still 254 or fewer distinct values.
What is interesting is that I see three different results:
If no select statements are performed against the table, there are no histograms generated:
If the data for the second and third columns are generated with the MOD function, only a histogram on the second column is produced:
If the values for the second and third columns are populated randomly with TRUNC(DBMS_RANDOM.VALUE(0,254)) then a histogram is generated on the second and third columns:
LikeLike
Richard,
I am confused why the DENISITY value for column C2 changed from 0.003937008 to 0.0000005 when queries have been executed against the table? This was the case for the table populated with MOD and the table populated with DBMS_RANDOM.VALUE.
LikeLike
Because oracle uses column usage (columns in where clauses) when deciding which columns are candidates for generating histograms on. This means that if you gathered stats last week, this week someone ran a new query, and you have method_opt set as above you’ll quite possibly get new stats (and cardinality estimates) for your queries. If you are coming to UKOUG I cover some of this in the graveyard slot on Wed pm (oracle calculates an expected cardinality for this talk as 1 🙂 ).
LikeLike
strange results – and I got them too on 11.2.0.1. Maybe DBMS_RANDOM brings a little bit more skew into the data than MOD?
Reading in Wolfgang Breitling’s Paper “Histograms – Myths and Facts” (http://www.centrexcc.com/Histograms%20-%20Myths%20and%20Facts.pdf) I see that “size auto” is the same as “size skewonly” for columns used in predicates. There’s also an explanation for the strange density .0000005 (= 1/(num_rows * 2) for frequency histograms).
In my tests with MOD there are histograms on both columns when I create the table, issue a query and gather statistics. After adding the additional row with VALUES (1000001, 42, 99999999) there remains only one histogram for the second column. With Event 10046 I see queries like:
LikeLike
And it seems that the decision to create the histogram is somehow based on BKTNDV and UNQREP that are 2 and 1 for the last ntile-row (with the maxvalue: Typ=2 Len=5: c4,64,64,64,64 aka 99999999). But I have no idea what that means…
Regards
Martin
LikeLike
Hi Martin
Indeed. With 1M rows and 254 dbms_random values, there’s likely enough natural skewness to warrant a histogram.
As I describe in my solution post, by adding the 255th value, the column no longer qualifies as an automatic frequency-based histogram.
And indeed, the density calculations differ between both No, Frequency and Height based histograms.
LikeLike
Hi Charles.
With no workload, AUTO will not generate histograms regardless.
With MOD, data is perfectly evenly distributed and so no histograms once you get the 255th distinct value.
With Random, there’s likely enough natural skewness to make Oracle go for the height based histogram.
To test this out, see what happens when you populate the table with 255 randonly generated values and you’ll likely still get the histogram.
LikeLike
Answer:
create table hist_test (id1 number, id2 number, id3 number);
insert into hist_test (id1, id2, id3) select rownum, mod(rownum,254), mod(rownum,254) from dual connect by level true,estimate_percent =>null,method_opt => 'for all columns size auto');
select table_name, column_name, num_distinct, density, num_buckets, histogram from dba_tab_columns where table_name = 'HIST_TEST';
TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
------------------------------ ------------------------------ ------------ ---------- ----------- ---------------
HIST_TEST ID3 255 .003921569 1 NONE
HIST_TEST ID2 254 .0000005 254 FREQUENCY
HIST_TEST ID1 1000001 .000001 1 NONE
select /*+ gather_plan_statistics */ count(*) from hist_test where id3>999999;
COUNT(*)
----------
2
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers |
------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.04 | 2392 |
|* 2 | TABLE ACCESS FULL| HIST_TEST | 1 | 990K| 1 |00:00:00.04 | 2392 |
create index hist_id3x on hist_test(id3);
select /*+ gather_plan_statistics */ count(*) from hist_test where id3>999999;
COUNT(*)
----------
2
select * from table(dbms_xplan.display_cursor(null,null,'ALLSTATS LAST'));
------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | Buffers | Reads |
------------------------------------------------------------------------------------------------------
| 1 | SORT AGGREGATE | | 1 | 1 | 1 |00:00:00.71 | 2051 | 2040 |
|* 2 | INDEX FAST FULL SCAN| HIST_ID3X | 1 | 990K| 1 |00:00:00.71 | 2051 | 2040 |
Result:
None on ID1
Frequency on ID2
None on ID3
I got same result on 10.2.0.4
I m assuming..(end of quiz will definitely solve my dount..I know)…
Now, since id3 has 255 distinct values, if oracle creates histogram, it ll be a HBH, therefore, it ll not know, how many values ‘d be there in last bucket..(as it ll divide values from 0-1000001 into 254 buckets.) That is why it doesnt create a histogram.
However, if we delete a few rows says…and gather stats again
delete from hist_test where id3=253;
commit;
exec dbms_stats.gather_table_stats(‘TECHM’,’HIST_TEST’,cascade => true,estimate_percent =>null,method_opt => ‘for all columns size auto’);
select table_name, column_name, num_distinct, density, num_buckets, histogram from dba_tab_columns where table_name = ‘HIST_TEST’;
TABLE_NAME COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
—————————— ——————– ———— ———- ———– —————
HIST_TEST ID1 996064 .000001004 1 NONE
HIST_TEST ID2 253 5.0198E-07 253 FREQUENCY
HIST_TEST ID3 254 5.0198E-07 254 FREQUENCY
It gets a frequency histogram on id3.
Does it answer it “Why”…I dont know.
LikeLike
Hi Yogesh
Hopefully the solution post will make everything clear now 🙂
254 is the magic number when it comes to frequency based histograms being automatically generated.
LikeLike
Richard,
IMHO, Column2 will be the only column having a histogram. Although you said there was no trick, I believe that there’s a trick here.
Maximum number of histogram buckets is 254. When we insert an extreme number to Column3, Oracle will need to put this new distinct value to a new bucket, but it won’t be able to add 255th bucket. So it will try to recalculate the buckets by redistributing all distinct values. This extreme number will force (mislead) Oracle to distribute all numbers in one bucket.
If you have less than 254 distinct values (or less than 254 buckets), adding this extreme number will not prevent creating histogram for Column3.
Regards
Gokhan
LikeLike
Hi Gokhan
Well done on part I.
However, it’s not the extreme number that causes the lack of histogram, it’s the fact it’s the 255th distinct value and there’s not enough natural skewness in most of the data.
LikeLike
Hi Richard,
I do my tests on 10.2.0.5 and in my tests, I see that “extreme number” is an important factor. If I insert “260” instead of 99999999, oracle creates “height balanced” histogram:
insert into bowie values (1000001, 42, 260 );
commit;
exec dbms_stats.gather_table_stats( user, tabname=>’BOWIE’, estimate_percent=> null, cascade=>true);
select COLUMN_NAME, NUM_DISTINCT, DENSITY, NUM_BUCKETS, HISTOGRAM from dba_tab_columns where table_name = ‘BOWIE’;
COLUMN_NAME NUM_DISTINCT DENSITY NUM_BUCKETS HISTOGRAM
—————————— ———————- ———————- ———————- —————
ID 1000001 0,000000999999000001 1 NONE
CODE1 254 0,0000004999995000005 254 FREQUENCY
CODE2 255 0,00393699631800763 254 HEIGHT BALANCED
Regards
Gokhan
LikeLike
[…] post is by way of reply to Richard Foote’s latest quiz available here Richard wants to know how many columns a histogram will be collected on given certain conditions. I […]
LikeLike
I seem to have different results to the above comments. My initial thought was that only column 2 would contain a histogram as all that had happened was the data distribution had changed slightly by the insert.
However, I then wondered why column 3 would not just change to a height balanced histogram.
So I decided to test the theory. I am using Oracle 11G R2 on a Windows 32 bit XP laptop. The results are pasted below:
The big surprise was an initial height balanced histogram on ID1 and the second surprise was a conversion of a frequency histogram on ID3 to a height balanced histogram which seems to go against the flow from previous posts on this question.
LikeLike
I notice my error, I have mistakenly used ‘SIZE 254’ instead of ‘SIZE AUTO’.
It’s early in the morning and I have not had enough caffeine yet!
LikeLike
Tony
I’m not sure I’d necessarily describe it as an error, just an illustration that the behaviour of DBMS_STATS is *very* dependent upon the exact parameters passed (or set in the environment) as well as underlying data distribution.
LikeLike
Hi Tony
When I don’t have my morning coffee by 9:30am, well just ask my colleagues at work what a disaster that can be 🙂
LikeLike
I now agree with the previous posts, There are initially two frequency histograms, one on ID2 and one on ID3. After the insert the histogram on ID3 is lost. However, I cannot explain why the frequency histogram on ID3 is not replaced by a height balanced histogram. Especially considering my flawed previous test did show a conversion from frequency to height balanced.
LikeLike
Hi Tony
It’s because Oracle is just not very good at spotting outlier values and its possible significance when deciding when to go for a histogram or not.
The only reason it had the frequency based histogram is just because there were 254 or less distinct values, that’s its only justification.
LikeLike
@Niall
Yes, the key bit you missed in your example is having to collect 100% accurate stats. Without that, Oracle may miss the outlier value, not realise it has more than 254 distinct values and create a histogram regardless.
LikeLike
Using Nial’s modified script.
Was curious as to why I was not getting Frequency histogram and it turns out that I had one too many bucket originally, 255 vs 254..
LikeLike
Yes, that one extra distinct value makes all the difference.
Note BTW that you’re using a random number to generate your values which explains why you’re getting a height based histogram after the 255th value as it likely is skewed enough still for stats collection to warrant a histogram.
LikeLike
@Gokhan
How did you populate the other data ? I think you find you have natural skewness in your existing data that has resulted in the histogram and so has nothing to do with adding the additional row. Adding a new row with data that is perfectly evenly distributed shouldn’t result in a height-based histogram.
I’ve tried to simulate your example (in various 10/11g databases) and could only do so with data that is already somewhat skewed (such as using dbms_random to generate your data).
LikeLike
Richard,
As I see the problem is how I populated the data. I have tried to use the your code to generate the data but I used two queries because “connect by” caused memory error:
insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual connect by level <= 500000;
insert into bowie select rownum + 500000, mod(rownum,254), mod(rownum,254) from dual connect by level <= 500000;
commit;
I haven't added 500000 to col2 and col3, so the data is not "perfectly" evenly distributed. After I inserted a new data, here's the distribution:
select code2, count(*) from bowie group by code2 order by code2 desc;
CODE2 COUNT(*)
———————- ———————-
260 1
253 3936
252 3936
251 3936
…
2 3938
1 3938
0 3936
255 rows selected
I have a question: You replied me "…it’s the fact it’s the 255th distinct value and there’s not enough natural skewness in most of the data…"
So the data I generated has "enough natural skewness" because it's not "perfectly even distribution of data"… How can we say the data I generated is naturally skewed?
Regards
Gokhan
LikeLike
Hi Gokhan
IMHO, I don’t think we really can say your data in skewed. I don’t believe the data is really skewed in the examples that use the dbms_random function either. For most practical purposes, the average density would be close enough and sufficient.
Unfortunately, that’s not the way stats collection sees it and it will generally collect way way way more histograms than is actually necessary (and for the possible issues with this, see Randolf’s excellent comments). And yet still miss cases where it might be useful such as my outlier example.
So I just wanted to point out that is was the existing data and not the new 260 value that caused your histogram.
I’ll say again, I’m not a great fan of SIZE AUTO or SKEWONLY 😉
LikeLike
Gokhan,
If you’re interested in determining the skewness of your data I suggest you read this article
http://en.wikipedia.org/wiki/Skewness
The section on Sample Skewness and Parson’s skewness can be applied to your data using SQL. The difference between the skewness coefficient and the mean value tells you how you data is skewed.
LikeLike
[…] answer to my previous question of which column is going to have a histogram when using the METHOD_OPT SIZE AUTO option is in […]
LikeLike
Hi Richard,
Following is my test script and result in Oracle Database 11g Express Edition Release 11.2.0.2.0.
SQL> select temporary, count(1) from ind_test group by temporary order by temporary;
T COUNT(1)
– ———-
F 82
M 10
N 38040
Y 82
Elapsed: 00:00:00.00
SQL> exec dbms_stats.gather_table_stats(user,’ind_test’,method_opt=>’for columns temporary size 5′);
PL/SQL procedure successfully completed.
Elapsed: 00:00:00.02
SQL> select table_name, num_distinct, low_value, high_value, num_buckets, histogram from dba_tab_col_statistics where table_name = ‘IND_TEST’;
TABLE_NAME NUM_DISTINCT LOW_VALUE HIGH_VALUE NUM_BUCKETS HISTOGRAM
—————————— ———— —————————————————————- —————————————————————- ———– —————
IND_TEST 4 46 59 4 FREQUENCY
Elapsed: 00:00:00.00
SQL> select column_name,endpoint_number,endpoint_value from user_tab_histograms where table_name = ‘IND_TEST’ ORDER BY ENDPOINT_NUMBER;
COLUMN_NAM ENDPOINT_NUMBER ENDPOINT_VALUE
———- ——————– ——————–
TEMPORARY 7 3.63460780097438E+35
TEMPORARY 11 3.99806858107182E+35
TEMPORARY 5491 4.04999154965717E+35
TEMPORARY 5499 4.62114420409600E+35
Question : I could not understand ENDPOINT_NUMBER & ENDPOINT_VALUE? I was expecting ENDPOINT_NUMBER as 82,92,38142, 38224.. Will you pl. explain how this nos are generated?
Thanks
Dax
LikeLike
Hi Dax
The reason you’re not getting the expected endpoint_number values is because you have only collected sample stats, not full compute stats. Add the estimate_percent parameter set to null and the numbers should be more accurate.
The endpoint_values are numeric representation of the actual values. Rather than going through it all, the following blog post by Martin Widlake does an excellent job of explaining it all:
LikeLike
Thank you very much Richard! Very good post from Martin
LikeLike
Hi Dax.
No worries. Martin has written many excellent blog articles, worth checking out.
LikeLike