jump to navigation

Storing Date Values As Numbers (The Numbers) June 1, 2016

Posted by Richard Foote in 12c, CBO, Histograms, Oracle Indexes, Storing Dates As Numbers.
4 comments

In my last couple of posts, I’ve been discussing how storing date data in a character based column is a really really bad idea.

In a follow-up question, I was asked if storing dates in NUMBER format was a better option. The answer is that it’s probably an improvement from storing dates as strings but it’s still a really really bad idea. Storing dates in DATE format is easily the best option as is storing any data in its native data type.

In this post, I’ll highlight a few of the classic issues with storing dates in basic number format as well as showing you some of the calculations on the CBO cardinality estimates.

As usual, the demo starts with a basic little table that I’ll populate with date data stored in a NUMBER column (ZIGGY_DATE):

SQL> create table ziggy (id number, code number, ziggy_date number);
    
Table created.

SQL> insert into ziggy select rownum, mod(rownum,1000), 
to_number(to_char(sysdate-mod(rownum,10000), 'YYYYMMDD')) 
from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

We’ll now collect statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
----------- ------------ ---------- --------------- --- ---
ZIGGY_DATE         10000      .0001 NONE            NO  NO
CODE                1000       .001 NONE            NO  NO
ID               1000000    .000001 NONE            NO  NO

So the ZIGGY_DATE column has 10,000 distinct dates (with 100 rows per distinct date), with a column density of 1/10000 = 0.0001.

Let’s now create a standard B-Tree index on the ZIGGY_DATE column:

SQL> create index ziggy_date_i on ziggy(ziggy_date);
                  
Index created.

If we look a sample of the data in the column and the min/max date ranges:

SQL> select * from ziggy where rownum <11;
        
        ID       CODE ZIGGY_DATE
---------- ---------- ----------
       776        776   20140412
       777        777   20140411
       778        778   20140410
       779        779   20140409
       780        780   20140408
       781        781   20140407
       782        782   20140406
       783        783   20140405
       784        784   20140404
       785        785   20140403

SQL> select min(ziggy_date) min, max(ziggy_date) max from ziggy;

       MIN        MAX
---------- ----------
  19890110   20160527

We see that all the data in the ZIGGY_DATE column are just number representations of dates, with a range between 10 Jan 1989 and 27 May 2016.

Note there are actually 10,000 days between the dates but the CBO would estimate a range of  270,417 possible days (20160527 – 19890110 = 270,417). The CBO has no idea that the “numbers” within the column are all dates and that there are ranges of values in which data is relatively popular (e.g. between say 20160101 and 20160131) and ranges of values in which data is relatively unpopular (e.g. say between 20154242 and 20159999).

Although not as bad as the range of possible unpopular values found within a character data type as I discussed previously when storing date data as a string, there is still enough data skew when storing dates as numbers to be problematic to the CBO.

If we select just one date with an equality predicate:

SQL> select * from ziggy where ziggy_date = 20150613;
                 
100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  100 |  1500 |     103 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  100 |  1500 |     103 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  100 |       |       3 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE"=20150613)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
110 consistent gets
0 physical reads
0 redo size
3883 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

The CBO gets things spot on, correctly estimating 100 rows to be returned, as the CBO knows there are only 10,000 distinct values of which only one of those values is being selected.

Selectivity is basically the density of the column = 1/10000 = 0.0001, so the estimated cardinality is 0.0001 x 1M rows = 100 rows. Perfect.

However, if we perform a range based query as follows:

SQL> select * from ziggy where ziggy_date between 20151010 and 20151111;
     
3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  573 |  8595 |     580 (1) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  573 |  8595 |     580 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  573 |       |       4 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE">=20151010 AND "ZIGGY_DATE"<=20151111)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3531 consistent gets
0 physical reads
0 redo size
108973 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO has got things somewhat incorrect in this example and has underestimated the expect number of rows (573 rows vs. the 3,300 rows actually returned).

The actual number of days between these dates is 33 so the actual ratio of data returned is 33/10000 x 1M rows = 3,300 rows. This is a range of “numbers” that overall covers a relatively “popular” range of  date values.

However Oracle is estimating a range of some 20151111 – 20151010 = 101 days between these dates. As the total range of possible days 20160527-19890110 = 270,417, the estimated ratio of returned rows is 101/270417 plus 2 x selectivity of a day for the implicit 2 equality conditions (as a between is effectively >= and <=). The selectivity of one day is just the density of the column, 0.0001 as illustrated in the previous query.

Therefore, the query selectivity is derived as being (101/270417) + (2 x 0.0001) = 0.000573 when multiplied by 1M rows = 573 rows as estimated by the CBO.

So the CBO is rather significantly *under* estimating the rows to be returned which could result in a sub-optimal execution plan (such as the inappropriate use of an index range scan as in this example, noting the poor clustering of the data).

If we now look at another range scan below:

SQL> select * from ziggy where ziggy_date between 20151225 and 20160101;
    
800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 33023 |  483K |    810 (15) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 33023 |  483K |    810 (15) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ZIGGY_DATE">=20151225 AND "ZIGGY_DATE"<=20160101)

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2824 consistent gets
0 physical reads
0 redo size
23850 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The actual number of days between these dates is only 8 so the actual ratio of data returned is 8/10000 x 1M rows = 800 rows. This is a range of “numbers” that overall covers a relatively “unpopular” range of date values.

However Oracle is estimating a range of some 20160101 – 20151225 = 8876 days between these dates. As the total range of possible days is 20160527-19890110 = 270,417, the estimated ratio of returned rows is 8876/270417 plus 2 x the selectivity of a single day again for the 2 implicit equality conditions.

Therefore, the query selectivity is derived as being (8876/270417) + (2 x 0.0001) = 0.033023 when multiplied by 1M rows = 33,023 rows as estimated by the CBO.

So the CBO is rather significantly *over* estimating the rows to be returned which could again result in a sub-optimal execution plan (or the inappropriate use of a Full Table Scan in this example). The CBO is simply not picking up the fact that most of the possible values between the “number” ranges aren’t valid dates and can’t possibly exist.

Of course, having dates stored as simple numbers means Oracle has no way of ensuring data integrity and can allow “invalid” dates to be inserted:

SQL> insert into ziggy values (1000001, 42, 20160599);
            
1 row created.

SQL> rollback;

Rollback complete.

As with dates stored as strings, we can again address these issues by either collecting histograms for such columns and/or by creating a function-based date index on the column:

SQL> create index ziggy_date_fn_i on ziggy(to_date(ziggy_date,'YYYYMMDD'));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME  NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ ------------ ---------- --------------- --- ---
SYS_NC00004$        10000      .0001 NONE            YES YES
ZIGGY_DATE          10000      .0001 HYBRID          NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

The associated query with the equality predicate has accurate estimates as it did previously:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') = '13-JUN-2015';

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  100 |   2300 |     103 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  100 |   2300 |     103 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  100 |        |       3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')=TO_DATE(' 2015-06-13 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
111 consistent gets
0 physical reads
0 redo size
2877 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

As the virtual column created for the function-based index also has 10,000 distinct values and a corresponding density of 0.0001, the CBO is getting the cardinality estimate of 100 rows spot on.

But importantly, both associated range based queries are now also being accurately costed by the CBO as it now knows the data being searched is date based and hence can more accurately determine the actual expected dates to be returned within the specified “date” ranges.

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '10-OCT-2015' and '11-NOV-2015';

3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 3400 | 78200 |   1061 (35) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 3400 | 78200 |   1061 (35) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE('
2015-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2015-11-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
2991 consistent gets
0 physical reads
0 redo size
95829 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO is now estimating not 573 rows, but 3,400 rows which is much closer to the actual 3,300 rows being returned. As a result, the CBO is now performing a more efficient Full Table Scan (due to the poor Clustering Factor of the index) than the Index Range Scan performed previously.

If we look at the other range scan query:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '25-DEC-2015' and '01-JAN-2016';

800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  900 | 20700 |      909 (1)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  900 | 20700 |      909 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  900 |       |        5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE(' 2015-12-25 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2016-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
861 consistent gets
7 physical reads
0 redo size
18917 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The CBO is now estimating not 33023 rows, but 900 rows which is again much closer to the actual 800 rows being returned. As a result, the CBO is now performing a more efficient Index Range Scan than the Full Table Scan is was previously.

And of course, the database via the function-based date index now has a manner in which protect the integrity of the date data:

SQL> insert into ziggy values (1000001, 42, 20160599);
insert into ziggy values (1000001, 42, 20160599)
*
ERROR at line 1:
ORA-01847: day of month must be between 1 and last day of month

However, the best way in Oracle to store “Date” data is within a Date data type column …

METHOD_OPT=> SIZE AUTO Quiz Solution (The Trickster) September 1, 2011

Posted by Richard Foote in CBO, Histograms, Oracle Indexes, Oracle Statistics.
16 comments

I was going to leave it for a few days but there have already been so many comments and discussions on all this, I thought I better write something up. In case anyone was wondering, yes I probably am driving my colleagues at work mad with my “Question of the Day” !!

Unfortunately, some might be disappointed at both Oracle and myself:)

Yes, I did kinda set things up to trick the unwary and yes, perhaps the answer isn’t what many are expecting.

The answer to my previous question of which column is going to have a histogram when using the METHOD_OPT  SIZE AUTO option is in fact Column 2. Well done to everyone who got it right.

Why ?

The simplest answer is because it’s the only column of the three that has 254 or less distinct values.

Here’s the key point. When using METHOD_OPT SIZE AUTO, every column with 254 or less distinct values that has been referenced within a predicate, will have a Frequency-based histogram. Each and every one of them, regardless of whether the data is actually skewed or not. So Column 2 with only 254 distinct values AND having previously been referenced in a predicate was guaranteed to have a histogram.

If a column has more than 254 distinct values, whether it then has a Height-Based histogram depends on how the data is skewed. If the data is perfectly evenly distributed, then it won’t have a histogram. Column 1, having sequenced based unique values will not meet the criteria and so not have a histogram.

Column 3 is interesting. Having inserted the outlier value, it now has 255 distinct values and so no longer qualifies for an automatic frequency based histogram. However, if all its values are evenly distributed, then it won’t qualify for a height based histogram either and Column 3 only has just the one outlier value, all other values are evenly distributed values. Unfortunately, Oracle doesn’t pick up on rare outlier values (even if you collect 100% statistics and it’s one of the low/high points of the column) and so will not generate a height-based histogram.

The only column that qualifies is Column 2.

A demo to illustrate. First, let’s create and populate our table:

SQL> create table bowie (id number, code1 number, code2 number);

Table created.

SQL> insert into bowie select rownum, mod(rownum,254), mod(rownum,254) from dual  connect by level <= 1000000;

1000000 rows created.

SQL> commit;

Commit complete.

Notice I’m using a MOD function to generate a perfectly even distribution of data. I’ve noticed a few examples (such as that by Charles Hooper in the comments of the Quiz posting), in which the DBMS_RANDOM function is used. Note this will almost certainly generate data with enough natural skewness on a 1M table with 254 random values that when the outlier 255th value is introduced, it will qualify for a height-based histogram. Very easy way to test and find out. Simply generate the 1M data with 255 random values and I suggest a height-based histogram is created regardless.

OK, I’ll run some SQL to generate sufficient workload to qualify the columns for automatic histograms:

SQL> select * from bowie where id = 42;
SQL> select * from bowie where code1 = 42;
SQL> select * from bowie where code2 = 42;

BTW, the difference between the SIZE AUTO and SIZE SKEWONLY options, is that AUTO requires previous workload to suggest a histogram might be relevant, SKEWONLY does not. 

If we were to collect statistics at this stage, we would notice that the second and third columns both have a Frequency-Based histogram as both columns only have 254 distinct values and so automatically qualify:

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          FREQUENCY

If we were to run a query using the third column, notice how the cardinality estimates aren’t too bad in this example:

SQL> select * from bowie where code2 > 600;

no rows selected

Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |     1 |    13 |   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |     1 |    13 |   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

There are no rows that are greater than 600 and so an estimate of 1 isn’t too bad at all.

OK, let’s add in this one, tiny little row and collect fresh, <strong>100% accurate statistics</strong> (Note: the accurate statistics is very important as Niall’s examples has demonstrated):

&nbsp;

SQL> insert into bowie values (1000001, 42, 99999999);

1 row created.

SQL> commit;

Commit complete.

SQL> exec dbms_stats.gather_table_stats(ownname=>'BOWIE', tabname=>'BOWIE', estimate_percent=> null, cascade=>true);

PL/SQL procedure successfully completed.

SQL> select column_name, histogram from dba_tab_columns where table_name = 'BOWIE';

COLUMN_NAME                    HISTOGRAM
------------------------------ ---------------
ID                             NONE
CODE1                          FREQUENCY
CODE2                          NONE

Note that the third column now has 255 distinct values and so no longer qualifies for the automatic Frequency-Based histogram. As most of its data is perfectly evenly distributed with just the one outlier value, the column doesn’t qualify for a Height-based histogram either and so now has no histogram at all.

Note as I collected 100% accurate statistics, Oracle is definitely aware of this outlier value:

SQL> select column_name, low_value, high_value from dba_tab_columns where table_name='BOWIE' and column_name='CODE2';

COLUMN_NAME  LOW_VALUE  HIGH_VALUE
------------ ---------- ------------
CODE2        80         C464646464

SQL> var high_num number
SQL> exec dbms_stats.convert_raw_value('C464646464',:high_num);

PL/SQL procedure successfully completed.

SQL> print high_num

  HIGH_NUM
----------
  99999999

But it’s not enough for Oracle to automatically generate a histogram. Which is a shame really, because now we can have all sorts of problems:

SQL> select * from bowie where code2 > 600;
Execution Plan
----------------------------------------------------------
Plan hash value: 1845943507

---------------------------------------------------------------------------
| Id  | Operation         | Name  | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |       |   999K|    12M|   660   (2)| 00:00:08 |
|*  1 |  TABLE ACCESS FULL| BOWIE |   999K|    12M|   660   (2)| 00:00:08 |
---------------------------------------------------------------------------

When previously it had the cardinality estimates spot on, now they’re terrible (expecting not 1 row but 999,000 rows !!) because without a histogram, Oracle is assuming even distribution between its low and high point values.

I’m not a great fan of either the SIZE AUTO or SIZE SKEWONLY options😉

Hope you’re enjoying these little quizzes, I’ll have another one for you all soon.

Follow

Get every new post delivered to your Inbox.

Join 2,450 other followers