Storing Date Values As Characters Part II (A Better Future) May 30, 2016
Posted by Richard Foote in 12c, CBO, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.5 comments
In the previous post, I discussed how storing date values within a character data type is a really really bad idea and illustrated how the CBO can easily get its costings totally wrong as a result. A function-based date index helped the CBO get the correct costings and protect the integrity of the date data.
During the demo, I re-collected statistics on the table as the associated hidden virtual column after creating the function-based index doesn’t have statistics.
Before re-collecting statistics:
SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE';</pre> COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ ------------ ---------- --------------- --- --- SYS_NC00004$ NONE YES YES BOWIE_DATE 10000 .0001 NONE NO NO CODE 1000 .001 NONE NO NO ID 1000000 .000001 NONE NO NO
And afterwards:
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE'); PL/SQL procedure successfully completed. SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ ------------ ---------- --------------- --- --- SYS_NC00004$ 10000 .0001 NONE YES YES BOWIE_DATE 10000 .0001 HYBRID NO NO CODE 1000 .001 NONE NO NO ID 1000000 .000001 NONE NO NO
We can see that the hidden virtual column now has statistics.
But we also notice another difference, that being the BOWIE_DATE column now has a histogram (of type Hybrid).
As discussed in the previous post, the issue here is that the date data within the character column covers only a very specific subset of all the potential character values that could reside within the column. Therefore the CBO is getting the range scan selectivity hopelessly incorrect.
Now that we’ve run a few queries featuring the BOWIE_DATE column in the predicates and as there’s effectively data skew within the column, the column becomes a candidate for a histogram with the default SIZE AUTO collection method.
The histogram now provides the CBO with a much more accurate picture of the distribution of the data within the BOWIE_DATE and that between discrete “date” column values, there are only so many rows that qualify.
As a result of the histogram, the CBO can now make much more accurate cardinality estimates.
If we now re-run the query that actually returns 8300 rows but the CBO previously estimated only 100 rows be returned:
SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31' 8300 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 4152 | 83040 | 1000 (12) | 00:00:01 | |* 1 | TABLE ACCESS FULL | BOWIE | 4152 | 83040 | 1000 (12) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31') Statistics ---------------------------------------------------------- 14 recursive calls 0 db block gets 4063 consistent gets 0 physical reads 0 redo size 282075 bytes sent via SQL*Net to client 6635 bytes received via SQL*Net from client 555 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 8300 rows processed
We see that at an estimated 4,152 rows, it’s a much better estimate. Not perfect, but maybe good enough to now get the more efficient Full Table Scan execution plan.
If we re-run the query that returned over 1/2 the table at some 570,000 rows but with the CBO previously estimating only 116 rows:
SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31';</pre> 570800 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 572K | 10M | 1012 (13) | 00:00:01 | |* 1 | TABLE ACCESS FULL | BOWIE | 572K | 10M | 1012 (13) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31') Statistics ---------------------------------------------------------- 14 recursive calls 0 db block gets 41456 consistent gets 4 physical reads 0 redo size 19292352 bytes sent via SQL*Net to client 419135 bytes received via SQL*Net from client 38055 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 570800 rows processed
We see that at an estimate of 572K rows, it’s now got this just about right and again has made the right decision with the Full Table Scan execution plan.
Storing date data in character based columns is still a really really bad idea and limits the manner in which date data can be analysed, protected and accessed, but with appropriate histograms in place, at least the CBO has some chance of making a reasonable fist of things with some range based queries.
As a follow-up, I was asked if storing dates in NUMBER format is a better option than as a string. I’ll discuss that next.
Storing Date Values As Characters (What’s Really Happening) May 26, 2016
Posted by Richard Foote in 12c, Function Based Indexes, Oracle Indexes, Storing Dates as Characters.1 comment so far
For something that’s generally considered an extremely bad idea, I’ve lost count of the number of times I’ve come across applications that insist on storing date values as characters within the database. We’ve all seen them …
I recently got called in to assist a customer who was having issues with a POC in relation to the database not appearing to want to use the In-Memory Database option as expected. In various key scenarios, the CBO kept coming up with execution plans that used index plans (they were hesitant to drop these particular indexes), when if it only just used the Database In-Memory store, the queries ran so much faster. So I was called in to find out what’s really happening and it turned out that the main culprit was indeed queries against columns where dates were stored as characters within the database. In the process, we found another issue with some “invalid” date values. Go figure.
Interestingly, both issues could be addressed by creating a new index on the date column …
I’ve kinda replicated the scenario here with the following little test case. I’ll begin by creating a table with a varchar2 field (bowie_date) that actually stores a whole bunch of “dates”:
SQL> create table bowie (id number, code number, bowie_date varchar2(42)); Table created SQL> insert into bowie select rownum, mod(rownum,1000), to_char(sysdate-mod(rownum,10000), 'YYYY MM DD') from dual connect by level <=1000000; 1000000 rows created. SQL> commit; Commit complete.
So the bowie_date column basically has 10000 different dates, with each date evenly distributed with 100 occurrences for each date.
I’ll now collect statistics on the table:
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE'); PL/SQL procedure successfully completed. SQL> select column_name, num_distinct, density, histogram, hidden_column, virtual_column from dba_tab_cols where table_name='BOWIE'; COLUMN_NAME NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ ------------ ---------- --------------- --- --- BOWIE_DATE 10000 .0001 NONE NO NO CODE 1000 .001 NONE NO NO ID 1000000 .000001 NONE NO NO
So the bowie_date column indeed has 10000 distinct dates.
I’ll now create a standard b-tree index on this column:
SQL> create index bowie_date_i on bowie(bowie_date); Index created.
So the data looks as follows with the bowie_date a varchar2 column that actually contains date data, with the following min/max ranges:
SQL> select * from bowie where rownum <11; ID CODE BOWIE_DATE ---------- ---------- ---------- 916 916 2013 11 22 917 917 2013 11 21 918 918 2013 11 20 919 919 2013 11 19 920 920 2013 11 18 921 921 2013 11 17 922 922 2013 11 16 923 923 2013 11 15 924 924 2013 11 14 925 925 2013 11 13 10 rows selected. SQL> select min(bowie_date) min, max(bowie_date) max from bowie; MIN MAX ---------- ---------- 1989 01 09 2016 05 26
If we run a query that uses an equality predicate as follows:
SQL> select * from bowie where bowie_date = '2015 06 13'; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1525056162 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2000 | 103 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE | 100 | 2000 | 103 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | BOWIE_DATE_I | 100 | | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BOWIE_DATE"='2015 06 13') Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 110 consistent gets 16 physical reads 0 redo size 4383 bytes sent via SQL*Net to client 618 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
The CBO gets things just about right. 100 rows are estimated and indeed 100 rows are retrieved. So we have confidence the CBO has made the right decision in using the index here as so few rows are actually retrieved.
However, if we run a range scan predicate such as the following:
SQL> select * from bowie where bowie_date between '2015 10 10' and '2015 12 31'; 8300 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1525056162 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2000 | 104 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE | 100 | 2000 | 104 (0)| 00:00:01 | |*2 | INDEX RANGE SCAN | BOWIE_DATE_I | 100 | | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BOWIE_DATE">='2015 10 10' AND "BOWIE_DATE"<='2015 12 31') Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 8881 consistent gets 38 physical reads 0 redo size 315219 bytes sent via SQL*Net to client 6635 bytes received via SQL*Net from client 555 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 8300 rows processed
The CBO has got the costings wrong here. It still estimates only 100 rows are to be returned when in actual fact 8300 rows come back.
If we select an even larger “date” range:
SQL> select * from bowie where bowie_date between '2000 10 10' and '2016 12 31'; 570800 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1525056162 ---------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ---------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 116 | 2320 | 120 (0)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED| BOWIE | 116 | 2320 | 120 (0)| 00:00:01 | |* 2 | INDEX RANGE SCAN | BOWIE_DATE_I | 116 | | 3 (0)| 00:00:01 | ---------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access("BOWIE_DATE">='2000 10 10' AND "BOWIE_DATE"<='2016 12 31') Statistics ---------------------------------------------------------- 1 recursive calls 0 db block gets 610491 consistent gets 1704 physical reads 0 redo size 21575496 bytes sent via SQL*Net to client 419135 bytes received via SQL*Net from client 38055 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 570800 rows processed
The CBO has got things seriously wrong here. We’re actually returning over 1/2 the table, some 570,800 rows but the CBO thinks only 116 rows will be returned. Why ?
The problem comes back to storing date values as characters. The CBO has absolutely no idea that these “characters” are actually meant to be dates and has no idea that the only valid ranges of values are date values.
With a possible range between “1989 01 09” and “2016 05 26“, any character range/combination of values (up to the 42 column size) could potentially exist in this column ( value ‘1zxgs$.jKN6tglhasgdlhlhd23bkbk?k’ for example).
So the required range between ‘2000 10 10’ and ‘2016 12 31’ actually represents a relatively narrow range of possible values within the range of all possible values (especially as the leading column differs between the min/max).
Hence why the CBO is estimating such a low number of rows to be returned and hence why the CBO is deciding to incorrectly use the index. “Hiding” the meaning and distribution of values from the CBO in this manner can be problematic to say the least.
Worse of course is also the possibility of “invalid” dates being entered as the database has no implicit way to police the integrity of the data:
SQL> insert into bowie values (1000001, 42, '2016 6 31'); 1 row created. SQL> rollback; Rollback complete.
We know there’s no such date as 31st June but Oracle has no idea that this is logically invalid data. Or the value ‘lh;ghsgdsd7gdGLH96bb’ for that matter …
Did I mention that storing dates in a character column is a really really bad idea …
Now there are a couple of ways to help address these issues if changing the column and its datatype is not possible due to the application. One way is to create a function-based index as follows on a date version of the column:
SQL> create index bowie_date_fn_i on bowie(to_date(bowie_date,'YYYY MM DD')); Index created.
If we run the equality query but this time referencing the to_date function:
SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015'; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 960797537 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 10000 | 273K | 893 (2)| 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE | 10000 | 273K | 893 (2)| 00:00:01 | |* 2 | INDEX RANGE SCAN | BOWIE_DATE_FN_I | 4000 | | 3 (0)| 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 12 recursive calls 0 db block gets 113 consistent gets 16 physical reads 0 redo size 3268 bytes sent via SQL*Net to client 618 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
We notice the CBO has got the row estimate way wrong here, thinking that 10000 rows, not 100 rows are to be returned. Why ?
Remember, when we create a function-based index, Oracle creates a hidden virtual column on the table as Oracle needs a way to store the statistics associated with the result set from the function. But these statistics aren’t populated until we next collect statistics on the table (or explicitly for just the hidden columns) and without the column statistics, the CBO can make poor assumptions:
SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col umn, virtual_column from dba_tab_cols where table_name='BOWIE'; COLUMN_NAME DATA_TYPE NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ --------- ------------ ---------- --------------- --- --- SYS_NC00004$ DATE NONE YES YES BOWIE_DATE VARCHAR2 10000 .0001 NONE NO NO CODE NUMBER 1000 .001 NONE NO NO ID NUMBER 1000000 .000001 NONE NO NO
There are no column statistics for the virtual column but we note the data type of the virtual column is DATE. So let’s collect new statistics on the table:
SQL> exec dbms_stats.gather_table_stats(ownname=>null, tabname=>'BOWIE'); PL/SQL procedure successfully completed. . SQL> select column_name, data_type, num_distinct, density, histogram, hidden_col umn, virtual_column from dba_tab_cols where table_name='BOWIE'; COLUMN_NAME DATA_TYPE NUM_DISTINCT DENSITY HISTOGRAM HID VIR ------------ --------- ------------ ---------- --------------- --- --- SYS_NC00004$ DATE 10000 .0001 NONE YES YES BOWIE_DATE VARCHAR2 10000 .0001 HYBRID NO NO CODE NUMBER 1000 .001 NONE NO NO ID NUMBER 1000000 .000001 NONE NO NO
There are actually two key differences in the above statistics, but in this post I’ll just focus on the fact that the hidden virtual column now has associated statistics (I’ll discuss the other key difference in my next blog post).
When we re-run the query:
SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') = '13-JUN-2015'; 100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 960797537 ------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | ------------------------------------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 100 | 2800 | 102 (0) | 00:00:01 | | 1 | TABLE ACCESS BY INDEX ROWID BATCHED | BOWIE | 100 | 2800 | 102 (0) | 00:00:01 | |* 2 | INDEX RANGE SCAN | BOWIE_DATE_FN_I | 100 | | 3 (0) | 00:00:01 | ------------------------------------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - access(TO_DATE("BOWIE_DATE",'YYYY MM DD')=TO_DATE(' 2015-06-13 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 0 recursive calls 0 db block gets 108 consistent gets 0 physical reads 0 redo size 3268 bytes sent via SQL*Net to client 618 bytes received via SQL*Net from client 8 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 100 rows processed
The CBO has now got the estimate spot on.
If we now run the previous range scan query that accessed 1/2 the table, referencing the to_date function:
SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2000' and '31-DEC-2015'; 556100 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 556K | 14M | 1328 (34)| 00:00:01 | |* 1 | TABLE ACCESS FULL | BOWIE | 556K | 14M | 1328 (34)| 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2000-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 8 recursive calls 0 db block gets 40444 consistent gets 0 physical reads 0 redo size 18804277 bytes sent via SQL*Net to client 408355 bytes received via SQL*Net from client 37075 SQL*Net roundtrips to/from client 0 sorts (memory) 0 sorts (disk) 556100 rows processed
The CBO has got the estimates pretty well spot on and is now performing the far more efficient Full Table Scan. The CBO knows that the virtual column is of type DATE and therefore can much more accurately determine the actual cardinality estimates for the range scan on the “date” column.
If we now run the other corresponding range scan that returned a moderate number of rows:
SQL> select * from bowie where to_date(bowie_date, 'YYYY MM DD') between '10-OCT-2015' and '31-DEC-2015'; 8300 rows selected. Execution Plan ---------------------------------------------------------- Plan hash value: 1845943507 --------------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost (%CPU) | Time | --------------------------------------------------------------------------- | 0 | SELECT STATEMENT | | 8401 | 229K | 1205 (27) | 00:00:01 | |* 1 | TABLE ACCESS FULL | BOWIE | 8401 | 229K | 1205 (27) | 00:00:01 | --------------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(TO_DATE("BOWIE_DATE",'YYYY MM DD')>=TO_DATE(' 2015-10-10 00:00:00', 'syyyy-mm-dd hh24:mi:ss') AND TO_DATE("BOWIE_DATE",'YYYY MM DD')<=TO_DATE(' 2015-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss')) Statistics ---------------------------------------------------------- 14 recursive calls 0 db block gets 4058 consistent gets 0 physical reads 0 redo size 282075 bytes sent via SQL*Net to client 6635 bytes received via SQL*Net from client 555 SQL*Net roundtrips to/from client 1 sorts (memory) 0 sorts (disk) 8300 rows processed
We notice that again the CBO has got the row estimate just about right and because the index has a poor clustering factor, the CBO still decided to go down the Full Table Scan path.
Even if we don’t use the index all that much (perhaps the Database In-Memory store is a better option for most queries) , it can still play an important role in policing the integrity of the data. An attempt to insert an invalid date will now automatically be captured by the database and fail:
SQL> insert into bowie values (1000001, 42, '2016 6 31'); insert into bowie values (1000001, 42, '2016 6 31') * ERROR at line 1: ORA-01839: date not valid for month specified
As the function-based index can only be populated or updated if a valid date is inserted into the table, any attempt to insert an invalid date will fail thus protecting the integrity of the data.
The best “fix” here is to store these dates in a date field within the database, where the above issues are automatically addressed. If this is not possible, then the introduction and usage of an associated function-based index can certainly assist the CBO in making the correct decision.
However, if data integrity is not a concern (the application does a fantastic job of it and no one ever has to perform manual data fixes directly in the database), then there’s another option to help make the CBO do the right thing.
The clue is back in how I collected the table statistics, which I’ll discuss in my next post.