Storing Date Values As Numbers (The Numbers) June 1, 2016
Posted by Richard Foote in 12c, CBO, Histograms, Oracle Indexes, Storing Dates As Numbers.trackback
In my last couple of posts, I’ve been discussing how storing date data in a character based column is a really really bad idea.
In a follow-up question, I was asked if storing dates in NUMBER format was a better option. The answer is that it’s probably an improvement from storing dates as strings but it’s still a really really bad idea. Storing dates in DATE format is easily the best option as is storing any data in its native data type.
In this post, I’ll highlight a few of the classic issues with storing dates in basic number format as well as showing you some of the calculations on the CBO cardinality estimates.
As usual, the demo starts with a basic little table that I’ll populate with date data stored in a NUMBER column (ZIGGY_DATE):
SQL> create table ziggy (id number, code number, ziggy_date number); Table created. SQL> insert into ziggy select rownum, mod(rownum,1000), to_number(to_char(sysdate-mod(rownum,10000), 'YYYYMMDD')) from dual connect by level <=1000000; 1000000 rows created. SQL> commit; Commit complete.
We’ll now collect statistics on the table:
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY'); PL/SQL procedure successfully completed. SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM HID VIR ----------- ------------ ---------- --------------- --- --- ZIGGY_DATE 10000 .0001 NONE NO NO CODE 1000 .001 NONE NO NO ID 1000000 .000001 NONE NO NO
So the ZIGGY_DATE column has 10,000 distinct dates (with 100 rows per distinct date), with a column density of 1/10000 = 0.0001.
Let’s now create a standard B-Tree index on the ZIGGY_DATE column:
SQL> create index ziggy_date_i on ziggy(ziggy_date); Index created.
If we look a sample of the data in the column and the min/max date ranges:
SQL> select * from ziggy where rownum <11; ID CODE ZIGGY_DATE ---------- ---------- ---------- 776 776 20140412 777 777 20140411 778 778 20140410 779 779 20140409 780 780 20140408 781 781 20140407 782 782 20140406 783 783 20140405 784 784 20140404 785 785 20140403 SQL> select min(ziggy_date) min, max(ziggy_date) max from ziggy; MIN MAX ---------- ---------- 19890110 20160527
We see that all the data in the ZIGGY_DATE column are just number representations of dates, with a range between 10 Jan 1989 and 27 May 2016.
Note there are actually 10,000 days between the dates but the CBO would estimate a range of 270,417 possible days (20160527 – 19890110 = 270,417). The CBO has no idea that the “numbers” within the column are all dates and that there are ranges of values in which data is relatively popular (e.g. between say 20160101 and 20160131) and ranges of values in which data is relatively unpopular (e.g. say between 20154242 and 20159999).
Although not as bad as the range of possible unpopular values found within a character data type as I discussed previously when storing date data as a string, there is still enough data skew when storing dates as numbers to be problematic to the CBO.
If we select just one date with an equality predicate:
SQL> select * from ziggy where ziggy_date = 20150613; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2700236208 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 1500 | 103 (0) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY | 100 | 1500 | 103 (0) | 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 100 | | 3 (0) | 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ZIGGY_DATE"=20150613) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 110 consistent gets 0 physical reads 0 redo size 3883 bytes sent via SQL*Net to client 618 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
The CBO gets things spot on, correctly estimating 100 rows to be returned, as the CBO knows there are only 10,000 distinct values of which only one of those values is being selected.
Selectivity is basically the density of the column = 1/10000 = 0.0001, so the estimated cardinality is 0.0001 x 1M rows = 100 rows. Perfect.
However, if we perform a range based query as follows:
SQL> select * from ziggy where ziggy_date between 20151010 and 20151111; 3300 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2700236208 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 573 | 8595 | 580 (1) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY | 573 | 8595 | 580 (1) | 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 573 | | 4 (0) | 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("ZIGGY_DATE">=20151010 AND "ZIGGY_DATE"<=20151111) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 3531 consistent gets 0 physical reads 0 redo size 108973 bytes sent via SQL*Net to client 2961 bytes received via SQL*Net from client 221 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 3300 rows processed
The CBO has got things somewhat incorrect in this example and has underestimated the expect number of rows (573 rows vs. the 3,300 rows actually returned).
The actual number of days between these dates is 33 so the actual ratio of data returned is 33/10000 x 1M rows = 3,300 rows. This is a range of “numbers” that overall covers a relatively “popular” range of date values.
However Oracle is estimating a range of some 20151111 – 20151010 = 101 days between these dates. As the total range of possible days 20160527-19890110 = 270,417, the estimated ratio of returned rows is 101/270417 plus 2 x selectivity of a day for the implicit 2 equality conditions (as a between is effectively >= and <=). The selectivity of one day is just the density of the column, 0.0001 as illustrated in the previous query.
Therefore, the query selectivity is derived as being (101/270417) + (2 x 0.0001) = 0.000573 when multiplied by 1M rows = 573 rows as estimated by the CBO.
So the CBO is rather significantly *under* estimating the rows to be returned which could result in a sub-optimal execution plan (such as the inappropriate use of an index range scan as in this example, noting the poor clustering of the data).
If we now look at another range scan below:
SQL> select * from ziggy where ziggy_date between 20151225 and 20160101; 800 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2421001569 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 33023 | 483K | 810 (15) | 00:00:01 | |* 1 | TABLE ACCESS FULL | ZIGGY | 33023 | 483K | 810 (15) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("ZIGGY_DATE">=20151225 AND "ZIGGY_DATE"<=20160101) Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 2824 consistent gets 0 physical reads 0 redo size 23850 bytes sent via SQL*Net to client 1135 bytes received via SQL*Net from client 55 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 800 rows processed
The actual number of days between these dates is only 8 so the actual ratio of data returned is 8/10000 x 1M rows = 800 rows. This is a range of “numbers” that overall covers a relatively “unpopular” range of date values.
However Oracle is estimating a range of some 20160101 – 20151225 = 8876 days between these dates. As the total range of possible days is 20160527-19890110 = 270,417, the estimated ratio of returned rows is 8876/270417 plus 2 x the selectivity of a single day again for the 2 implicit equality conditions.
Therefore, the query selectivity is derived as being (8876/270417) + (2 x 0.0001) = 0.033023 when multiplied by 1M rows = 33,023 rows as estimated by the CBO.
So the CBO is rather significantly *over* estimating the rows to be returned which could again result in a sub-optimal execution plan (or the inappropriate use of a Full Table Scan in this example). The CBO is simply not picking up the fact that most of the possible values between the “number” ranges aren’t valid dates and can’t possibly exist.
Of course, having dates stored as simple numbers means Oracle has no way of ensuring data integrity and can allow “invalid” dates to be inserted:
SQL> insert into ziggy values (1000001, 42, 20160599); 1 row created. SQL> rollback; Rollback complete.
As with dates stored as strings, we can again address these issues by either collecting histograms for such columns and/or by creating a function-based date index on the column:
SQL> create index ziggy_date_fn_i on ziggy(to_date(ziggy_date,'YYYYMMDD')); Index created. SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'ZIGGY'); PL/SQL procedure successfully completed. SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='ZIGGY'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ ------------ ---------- --------------- --- --- SYS_NC00004$ 10000 .0001 NONE YES YES ZIGGY_DATE 10000 .0001 HYBRID NO NO CODE 1000 .001 NONE NO NO ID 1000000 .000001 NONE NO NO
The associated query with the equality predicate has accurate estimates as it did previously:
SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') = '13-JUN-2015'; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 945728471 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2300 | 103 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY | 100 | 2300 | 103 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY_DATE_FN_I | 100 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 111 consistent gets 0 physical reads 0 redo size 2877 bytes sent via SQL*Net to client 618 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
As the virtual column created for the function-based index also has 10,000 distinct values and a corresponding density of 0.0001, the CBO is getting the cardinality estimate of 100 rows spot on.
But importantly, both associated range based queries are now also being accurately costed by the CBO as it now knows the data being searched is date based and hence can more accurately determine the actual expected dates to be returned within the specified “date” ranges.
SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '10-OCT-2015' and '11-NOV-2015'; 3300 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 2421001569 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 3400 | 78200 | 1061 (35) | 00:00:01 | |* 1 | TABLE ACCESS FULL | ZIGGY | 3400 | 78200 | 1061 (35) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE(' 2015-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2015-11-11 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 2991 consistent gets 0 physical reads 0 redo size 95829 bytes sent via SQL*Net to client 2961 bytes received via SQL*Net from client 221 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 3300 rows processed
The CBO is now estimating not 573 rows, but 3,400 rows which is much closer to the actual 3,300 rows being returned. As a result, the CBO is now performing a more efficient Full Table Scan (due to the poor Clustering Factor of the index) than the Index Range Scan performed previously.
If we look at the other range scan query:
SQL> select * from ziggy where to_date(ziggy_date, 'YYYYMMDD') between '25-DEC-2015' and '01-JAN-2016'; 800 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 945728471 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 900 | 20700 | 909 (1)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | ZIGGY | 900 | 20700 | 909 (1)| 00:00:01 | |* 2 | INDEX RANGE SCAN | ZIGGY_DATE_FN_I | 900 | | 5 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')>=TO_DATE(' 2015-12-25 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE(TO_CHAR("ZIGGY_DATE"),'YYYYMMDD')<=TO_DATE(' 2016-01-01 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 861 consistent gets 7 physical reads 0 redo size 18917 bytes sent via SQL*Net to client 1135 bytes received via SQL*Net from client 55 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 800 rows processed
The CBO is now estimating not 33023 rows, but 900 rows which is again much closer to the actual 800 rows being returned. As a result, the CBO is now performing a more efficient Index Range Scan than the Full Table Scan is was previously.
And of course, the database via the function-based date index now has a manner in which protect the integrity of the date data:
SQL> insert into ziggy values (1000001, 42, 20160599); insert into ziggy values (1000001, 42, 20160599) * ERROR at line 1: ORA-01847: day of month must be between 1 and last day of month
However, the best way in Oracle to store “Date” data is within a Date data type column …
Very interesting indeed, and especially whith all the examples. Could there be any gain in storing data in number format but, instead of the human readable example you used (YYYYMMDD), using the ‘seconds since 1970-01-01 00:00:00’ like in MySQL, for instance?
LikeLike
Hi Wolfgoe
Short answer, no as this is basically what a DATE data type does automatically for you.
LikeLike
A valuable article, i hope someone at SAP read this. Perhaps we get real dates in the databases someday. But it seems they made the same mistake in their HANA.
LikeLike
Hi Kay
Yes indeed, I’ve been asked to look at many SAP environments over the years. What SAP does in saving date data in the database is sad to say the least …
LikeLike
I would say “criminal”.
LikeLike
The real problem with non-native dates is data manipulation. If you compute 20160229 +1 and your algorithm is buggy, you may get a February 30th. I’ve seen it (not SAP).
Storing dates in anything other than dates is like storing numbers as strings (‘one’,’two’, ‘three’), a recipe for data corruption.
LikeLike
I completely agree. However, it’s indeed criminal how often I see basics such the wrong data type being implemented, which in turn leads to so many avoidable issues.
LikeLike
First of all, thank you for your many interesting blog posts. I’ve been playing with your example and I’ve come across something strange: It would be nice if you could take a look and write something about it.
P.S. I look forward to meeting you at Trivadis Performance Days 2018
Tested on 11GR2 and 12CR1
set autotrace on explain
Select * from ziggy where ziggy_date between 20151010 and 20151111;
/**
Same cardinality as in your example
—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 573 | 8595 | 655 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 573 | 8595 | 655 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 573 | | 7 (43)| 00:00:01 |
—————————————————————————————————-
**/
/**
But if I’m gonna to…
**/
alter table ziggy
add ziggy_asdate date;
update ziggy
set ziggy_asdate = to_date(ziggy_date, ‘YYYYMMDD’);
commit;
exec dbms_stats.gather_table_stats(ownname=>null, tabname=>’ZIGGY’);
Select * from ziggy where ziggy_date between 20151010 and 20151111;
/**
If I reexecute the same statement from above, I now get a different cardinality…
—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 3948 | 90804 | 4488 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 3948 | 90804 | 4488 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 3948 | | 28 (58)| 00:00:01 |
—————————————————————————————————-
**/
exec dbms_stats.gather_table_stats(ownname=>null, tabname=>’ZIGGY’);
Select * from ziggy where ziggy_date between 20151010 and 20151111;
/**
Even worse:
The cardinality changes everytime I regather the table stats…
—————————————————————————————————-
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
—————————————————————————————————-
| 0 | SELECT STATEMENT | | 2345 | 53935 | 2668 (12)| 00:00:01 |
| 1 | TABLE ACCESS BY INDEX ROWID BATCHED| ZIGGY | 2345 | 53935 | 2668 (12)| 00:00:01 |
|* 2 | INDEX RANGE SCAN | ZIGGY_DATE_I | 2345 | | 17 (53)| 00:00:01 |
—————————————————————————————————-
**/
LikeLike
Hi Daniel
Thanks for your example.
I suspect this is a cause of mistaken identity in that I’m not sure the action of adding a new column is the cause of the different cardinality estimates.
The issue here are good old histograms that are being automatically created and altered when you re-collect stats, Note when the query is first run, table stats are collected before the index is created and the query ever run and so likely no histograms are created at this stage. When stats are collected the second time (I suspect regardless of the new column) after index creation and use of index, histograms are now collected on the ziggy_date column. Subsequent collection of stats causes subtle changes to these histograms that results in the changes to the cardinality estimates.
Just check the histogram stats for the ziggy_date column at the various steps and see if indeed there are new histogram stats present each time.
Looking forward indeed to the Trivadis Performance Days conference 🙂
LikeLike
You were right. On the column ziggy_date there was a histogram. This may have been created due to an entry of the column ziggy_date in sys.col_usage$ when the statistics were collected for the second time. When I deleted the histogram, the expected cardinality came out despite the new column ziggy_asdate.
LikeLike