jump to navigation

Storing Date Values As Numbers (The Numbers) June 1, 2016

Posted by Richard Foote in 12c, CBO, Histograms, Oracle Indexes, Storing Dates As Numbers.
trackback

In my last couple of posts, I’ve been discussing how storing date data in a character based column is a really really bad idea.

In a follow-up question, I was asked if storing dates in NUMBER format was a better option. The answer is that it’s probably an improvement from storing dates as strings but it’s still a really really bad idea. Storing dates in DATE format is easily the best option as is storing any data in its native data type.

In this post, I’ll highlight a few of the classic issues with storing dates in basic number format as well as showing you some of the calculations on the CBO cardinality estimates.

As usual, the demo starts with a basic little table that I’ll populate with date data stored in a NUMBER column (ZIGGY_DATE):

SQL> create table ziggy (id number, code number, ziggy_date number);
    
Table created.

SQL> insert into ziggy select rownum, mod(rownum,1000), 
to_number(to_char(sysdate-mod(rownum,10000), 'YYYYMMDD')) 
from dual connect by level <=1000000;

1000000 rows created.

SQL> commit;

Commit complete.

We’ll now collect statistics on the table:

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
----------- ------------ ---------- --------------- --- ---
ZIGGY_DATE         10000      .0001 NONE            NO  NO
CODE                1000       .001 NONE            NO  NO
ID               1000000    .000001 NONE            NO  NO

So the ZIGGY_DATE column has 10,000 distinct dates (with 100 rows per distinct date), with a column density of 1/10000 = 0.0001.

Let’s now create a standard B-Tree index on the ZIGGY_DATE column:

SQL> create index ziggy_date_i on ziggy(ziggy_date);
                  
Index created.

If we look a sample of the data in the column and the min/max date ranges:

SQL> select * from ziggy where rownum <11;
        
        ID       CODE ZIGGY_DATE
---------- ---------- ----------
       776        776   20140412
       777        777   20140411
       778        778   20140410
       779        779   20140409
       780        780   20140408
       781        781   20140407
       782        782   20140406
       783        783   20140405
       784        784   20140404
       785        785   20140403

SQL> select min(ziggy_date) min, max(ziggy_date) max from ziggy;

       MIN        MAX
---------- ----------
  19890110   20160527

We see that all the data in the ZIGGY_DATE column are just number representations of dates, with a range between 10 Jan 1989 and 27 May 2016.

Note there are actually 10,000 days between the dates but the CBO would estimate a range of  270,417 possible days (20160527 – 19890110 = 270,417). The CBO has no idea that the “numbers” within the column are all dates and that there are ranges of values in which data is relatively popular (e.g. between say 20160101 and 20160131) and ranges of values in which data is relatively unpopular (e.g. say between 20154242 and 20159999).

Although not as bad as the range of possible unpopular values found within a character data type as I discussed previously when storing date data as a string, there is still enough data skew when storing dates as numbers to be problematic to the CBO.

If we select just one date with an equality predicate:

SQL> select * from ziggy where ziggy_date = 20150613;
                 
100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  100 |  1500 |     103 (0) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  100 |  1500 |     103 (0) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  100 |       |       3 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE"=20150613)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
110 consistent gets
0 physical reads
0 redo size
3883 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

The CBO gets things spot on, correctly estimating 100 rows to be returned, as the CBO knows there are only 10,000 distinct values of which only one of those values is being selected.

Selectivity is basically the density of the column = 1/10000 = 0.0001, so the estimated cardinality is 0.0001 x 1M rows = 100 rows. Perfect.

However, if we perform a range based query as follows:

SQL> select * from ziggy where ziggy_date between 20151010 and 20151111;
     
3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2700236208

----------------------------------------------------------------------------------------------------
| Id | Operation                           | Name         | Rows | Bytes | Cost (%CPU) | Time     |
----------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |              |  573 |  8595 |     580 (1) | 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY        |  573 |  8595 |     580 (1) | 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_I |  573 |       |       4 (0) | 00:00:01 |
----------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access("ZIGGY_DATE">=20151010 AND "ZIGGY_DATE"<=20151111)

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
3531 consistent gets
0 physical reads
0 redo size
108973 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO has got things somewhat incorrect in this example and has underestimated the expect number of rows (573 rows vs. the 3,300 rows actually returned).

The actual number of days between these dates is 33 so the actual ratio of data returned is 33/10000 x 1M rows = 3,300 rows. This is a range of “numbers” that overall covers a relatively “popular” range of  date values.

However Oracle is estimating a range of some 20151111 – 20151010 = 101 days between these dates. As the total range of possible days 20160527-19890110 = 270,417, the estimated ratio of returned rows is 101/270417 plus 2 x selectivity of a day for the implicit 2 equality conditions (as a between is effectively >= and <=). The selectivity of one day is just the density of the column, 0.0001 as illustrated in the previous query.

Therefore, the query selectivity is derived as being (101/270417) + (2 x 0.0001) = 0.000573 when multiplied by 1M rows = 573 rows as estimated by the CBO.

So the CBO is rather significantly *under* estimating the rows to be returned which could result in a sub-optimal execution plan (such as the inappropriate use of an index range scan as in this example, noting the poor clustering of the data).

If we now look at another range scan below:

SQL> select * from ziggy where ziggy_date between 20151225 and 20160101;
    
800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows  | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 33023 |  483K |    810 (15) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 33023 |  483K |    810 (15) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter("ZIGGY_DATE">=20151225 AND "ZIGGY_DATE"<=20160101)

Statistics
----------------------------------------------------------
1 recursive calls
0 db block gets
2824 consistent gets
0 physical reads
0 redo size
23850 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The actual number of days between these dates is only 8 so the actual ratio of data returned is 8/10000 x 1M rows = 800 rows. This is a range of “numbers” that overall covers a relatively “unpopular” range of date values.

However Oracle is estimating a range of some 20160101 – 20151225 = 8876 days between these dates. As the total range of possible days is 20160527-19890110 = 270,417, the estimated ratio of returned rows is 8876/270417 plus 2 x the selectivity of a single day again for the 2 implicit equality conditions.

Therefore, the query selectivity is derived as being (8876/270417) + (2 x 0.0001) = 0.033023 when multiplied by 1M rows = 33,023 rows as estimated by the CBO.

So the CBO is rather significantly *over* estimating the rows to be returned which could again result in a sub-optimal execution plan (or the inappropriate use of a Full Table Scan in this example). The CBO is simply not picking up the fact that most of the possible values between the “number” ranges aren’t valid dates and can’t possibly exist.

Of course, having dates stored as simple numbers means Oracle has no way of ensuring data integrity and can allow “invalid” dates to be inserted:

SQL> insert into ziggy values (1000001, 42, 20160599);
            
1 row created.

SQL> rollback;

Rollback complete.

As with dates stored as strings, we can again address these issues by either collecting histograms for such columns and/or by creating a function-based date index on the column:

SQL> create index ziggy_date_fn_i on ziggy(to_date(ziggy_date,'YYYYMMDD'));

Index created.

SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY');

PL/SQL procedure successfully completed.

SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY';

COLUMN_NAME  NUM_DISTINCT    DENSITY HISTOGRAM       HID VIR
------------ ------------ ---------- --------------- --- ---
SYS_NC00004$        10000      .0001 NONE            YES YES
ZIGGY_DATE          10000      .0001 HYBRID          NO  NO
CODE                 1000       .001 NONE            NO  NO
ID                1000000    .000001 NONE            NO  NO

The associated query with the equality predicate has accurate estimates as it did previously:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') = '13-JUN-2015';

100 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes  | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  100 |   2300 |     103 (0)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  100 |   2300 |     103 (0)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  100 |        |       3 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')=TO_DATE(' 2015-06-13 00:00:00',
'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
0 recursive calls
0 db block gets
111 consistent gets
0 physical reads
0 redo size
2877 bytes sent via SQL*Net to client
618 bytes received via SQL*Net from client
8 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
100 rows processed

As the virtual column created for the function-based index also has 10,000 distinct values and a corresponding density of 0.0001, the CBO is getting the cardinality estimate of 100 rows spot on.

But importantly, both associated range based queries are now also being accurately costed by the CBO as it now knows the data being searched is date based and hence can more accurately determine the actual expected dates to be returned within the specified “date” ranges.

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '10-OCT-2015' and '11-NOV-2015';

3300 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 2421001569

---------------------------------------------------------------------------
| Id | Operation         | Name  | Rows | Bytes | Cost (%CPU) | Time     |
---------------------------------------------------------------------------
| 0  | SELECT STATEMENT  |       | 3400 | 78200 |   1061 (35) | 00:00:01 |
|* 1 | TABLE ACCESS FULL | ZIGGY | 3400 | 78200 |   1061 (35) | 00:00:01 |
---------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

1 - filter(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE('
2015-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND
TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2015-11-11
00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
2991 consistent gets
0 physical reads
0 redo size
95829 bytes sent via SQL*Net to client
2961 bytes received via SQL*Net from client
221 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
3300 rows processed

The CBO is now estimating not 573 rows, but 3,400 rows which is much closer to the actual 3,300 rows being returned. As a result, the CBO is now performing a more efficient Full Table Scan (due to the poor Clustering Factor of the index) than the Index Range Scan performed previously.

If we look at the other range scan query:

SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '25-DEC-2015' and '01-JAN-2016';

800 rows selected.

Execution Plan
----------------------------------------------------------
Plan hash value: 945728471

-------------------------------------------------------------------------------------------------------
| Id | Operation                           | Name            | Rows | Bytes | Cost (%CPU) | Time     |
-------------------------------------------------------------------------------------------------------
| 0  | SELECT STATEMENT                    |                 |  900 | 20700 |      909 (1)| 00:00:01 |
| 1  | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY           |  900 | 20700 |      909 (1)| 00:00:01 |
|* 2 | INDEX RANGE SCAN                    | ZIGGY_DATE_FN_I |  900 |       |        5 (0)| 00:00:01 |
-------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE(' 2015-12-25 00:00:00',
'syyyy-mm-dd hh24:mi:ss') AND TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2016-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Statistics
----------------------------------------------------------
8 recursive calls
0 db block gets
861 consistent gets
7 physical reads
0 redo size
18917 bytes sent via SQL*Net to client
1135 bytes received via SQL*Net from client
55 SQL*Net roundtrips to/from client
0 sorts (memory)
0 sorts (disk)
800 rows processed

The CBO is now estimating not 33023 rows, but 900 rows which is again much closer to the actual 800 rows being returned. As a result, the CBO is now performing a more efficient Index Range Scan than the Full Table Scan is was previously.

And of course, the database via the function-based date index now has a manner in which protect the integrity of the date data:

SQL> insert into ziggy values (1000001, 42, 20160599);
insert into ziggy values (1000001, 42, 20160599)
*
ERROR at line 1:
ORA-01847: day of month must be between 1 and last day of month

However, the best way in Oracle to store “Date” data is within a Date data type column …

Comments»

1. wolfgoe - June 7, 2016

Very interesting indeed, and especially whith all the examples. Could there be any gain in storing data in number format but, instead of the human readable example you used (YYYYMMDD), using the ‘seconds since 1970-01-01 00:00:00’ like in MySQL, for instance?

Like

Richard Foote - June 20, 2016

Hi Wolfgoe

Short answer, no as this is basically what a DATE data type does automatically for you.

Like

2. Kay - June 17, 2016

A valuable article, i hope someone at SAP read this. Perhaps we get real dates in the databases someday. But it seems they made the same mistake in their HANA.

Like

Richard Foote - June 20, 2016

Hi Kay

Yes indeed, I’ve been asked to look at many SAP environments over the years. What SAP does in saving date data in the database is sad to say the least …

Like

K von Murphy - September 1, 2016

I would say “criminal”.

Like

3. K von Murphy - September 1, 2016

The real problem with non-native dates is data manipulation. If you compute 20160229 +1 and your algorithm is buggy, you may get a February 30th. I’ve seen it (not SAP).

Storing dates in anything other than dates is like storing numbers as strings (‘one’,’two’, ‘three’), a recipe for data corruption.

Like

Richard Foote - September 2, 2016

I completely agree. However, it’s indeed criminal how often I see basics such the wrong data type being implemented, which in turn leads to so many avoidable issues.

Like

4. Daniel Stein - June 27, 2018

First of all, thank you for your many interesting blog posts. I’ve been playing with your example and I’ve come across something strange: It would be nice if you could take a look and write something about it.

P.S. I look forward to meeting you at Trivadis Performance Days 2018

Tested on 11GR2 and 12CR1

set autotrace on explain

Select * from ziggy where ziggy_date between 20151010 and 20151111;

/**
Same cardinality as in your example

—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 573 | 8595 | 655 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 573 | 8595 | 655 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 573 | | 7 (43)| 00:00:01 |
—————————————————————————————————-
**/

/**
But if I’m gonna to…
**/

alter table ziggy
add ziggy_asdate date;

update ziggy
set ziggy_asdate = to_date(ziggy_date, ‘YYYYMMDD’);

commit;

exec dbms_stats.gather_table_stats(ownname=>null, tabname=>’ZIGGY’);

Select * from ziggy where ziggy_date between 20151010 and 20151111;

/**
If I reexecute the same statement from above, I now get a different cardinality…
—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 3948 | 90804 | 4488 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 3948 | 90804 | 4488 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 3948 | | 28 (58)| 00:00:01 |
—————————————————————————————————-

**/

exec dbms_stats.gather_table_stats(ownname=>null, tabname=>’ZIGGY’);

Select * from ziggy where ziggy_date between 20151010 and 20151111;

/**
Even worse:
The cardinality changes everytime I regather the table stats…

—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 2345 | 53935 | 2668 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 2345 | 53935 | 2668 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 2345 | | 17 (53)| 00:00:01 |
—————————————————————————————————-
**/

Like

Richard Foote - August 1, 2018

Hi Daniel

Thanks for your example.

I suspect this is a cause of mistaken identity in that I’m not sure the action of adding a new column is the cause of the different cardinality estimates.

The issue here are good old histograms that are being automatically created and altered when you re-collect stats, Note when the query is first run, table stats are collected before the index is created and the query ever run and so likely no histograms are created at this stage. When stats are collected the second time (I suspect regardless of the new column) after index creation and use of index, histograms are now collected on the ziggy_date column. Subsequent collection of stats causes subtle changes to these histograms that results in the changes to the cardinality estimates.

Just check the histogram stats for the ziggy_date column at the various steps and see if indeed there are new histogram stats present each time.

Looking forward indeed to the Trivadis Performance Days conference 🙂

Like

Daniel Stein - August 6, 2018

You were right. On the column ziggy_date there was a histogram. This may have been created due to an entry of the column ziggy_date in sys.col_usage$ when the statistics were collected for the second time. When I deleted the histogram, the expected cardinality came out despite the new column ziggy_asdate.

Like


Leave a comment