jump to navigation

Outlier Values – An Enemy Of The Index December 13, 2007

Posted by Richard Foote in Index Access Path, Indexing Tricks, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Outlier Values.
trackback

Outlier values are basically values that sit way way outside the standard range of a column’s normal value range.

Data can be a funny thing and sometimes there are values that are naturally “exceptional”. However, very commonly, outlier values are used by applications to represent bizarre default values, to avoid confusion with legitimate values. For example, I look after an application that uses the American Date Of Independence as it’s “default” date.

Usually, these weird outlier values are used to avoid nulls values, as nulls can be problematic and can not be indexed (well actually you can index a null column but we’ll leave that for another blog entry).

However, outlier values while (maybe) solving one problem, can introduce some very significant problems in return.

Firstly, the CBO “hates” outlier values as it potentially totally screws up the CBO’s selectivity calculations. The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value). Therefore if this calculation is invalidated by a massive and disprotionate “hole” in the full range of possible values, the CBO can get things horribly wrong.

See here for a simple demonstration:  Outlier Selectivity Problem

Additionally, indexes “hate” outlier values as it prevents Oracle using the 90-10 block split to keep indexes nice and compact and is forced to use 50-50 block splits instead. Basically a 90-10 block split is considered if and only if the index entry to be inserted is equal or greater than the current maximum value.  An outlier value that is also the maximum value,  usually means monotonically increasing values (such as sequences, dates, etc.) don’t actually insert the maximum value. Therefore, not only do indexes perform 50-50 splits but this 50% of free space is never used, as all new values are all almost, but not quite, maximum values.

Little demo to highlight this problem: Outlier Index Space Utilisation Problem 

In summary, avoid outlier values if at all possible.  They generally cause more problems than they solve !!

Comments»

1. Donald K. Burleson - December 13, 2007

Hi Richard,

As an index guru, you should explore the new 11g extended statistics . . . .

Like

2. Brian Tkatch - December 13, 2007

“In summary, avoid outlier values if at all possible. They generally cause more problems than they solve !!”

But how? The data is what it is.

Liked by 1 person

3. Don Seiler - December 13, 2007

Jonathan Lewis also touches on this subject in Chapter 6 of his CBO book in the section “Discrete Dangers”. Very good to highlight this for people developing applications and tempted to use these extreme values.

Like

4. Richard Foote - December 13, 2007

Don, I’ve already publically shown my love for extended statistics 🙂

http://groups.google.com/group/comp.databases.oracle.server/msg/d509e60eee76042d

Brian, the data is sometimes what we set it to be. As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …

Like

5. Alberto Dell'Era - December 13, 2007

Minor observation but maybe worth mentioning – the “hole” problem can be thought as a manifestation of skewness, which has two faces: skewness in the multiplicity of values (some values occur much more often than others) and skewness in the relative “distance” of values (the “hole” problem). The former can be a problem (especially) for equality predicates and equijoins, the latter for range predicates. Histograms can help for both – but of course, collecting an histogram is an option that has to be carefully weighted against their gotchas, so it’s way much better to avoid “holes” whenever possible (which is of course what you’re saying here).

Like

6. Richard Foote - December 14, 2007

Hi Alberto, absolutely worth mentioning. Outlier values effectively causes a skew effect in the data distribution which could be addressed with histograms. However, histograms come with a cost and prevention (if possible) is the best cure. Also, it doesn’t of course address the index space problem.

PS. I also changed the wording in the Blog entry as per your (very good) suggestion. Thanks 🙂

Like

7. Brian Tkatch - December 14, 2007

“As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …”

Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.

But if the data itself actually has an outlier value, not much can be done about it?

I never even thought about the subject mentioned here. Thank you for mentioning it.

Like

8. Richard Foote - December 14, 2007

“Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.”

Exactly !!

“But if the data itself actually has an outlier value, not much can be done about it?”

Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.

“I never even thought about the subject mentioned here. Thank you for mentioning it.”

No worries 🙂

Like

9. Brian Tkatch - December 17, 2007

“Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.”

And that is reading for another day. Sheesh, so much to learn. 🙂

Like

10. Richard Foote - December 18, 2007

One never stops learning, get’s a tad boring otherwise 🙂

Like

11. meet the needs - December 18, 2017

It’s truly very complex in this active life to listen news
on TV, therefore I simply use world wide web for that purpose,
and get the latest news.

Like

12. Twyla - October 22, 2018

Great post. I was checking constantly this blog and I am impressed!

Very helpful information particularly the last
part 🙂 I care for such information much. I was looking
for this certain information for a very long time. Thank you and best of luck.

Like

13. ceme online - October 29, 2018

Judi togel

Outlier Values

Like

14. Julie - April 11, 2019

Excellent post. I was checking continuously this weblog and
I am impressed! Very useful information specifically the
remaining phase 🙂 I care for such information much. I used to
be seeking this certain information for a long time.
Thanks and good luck.

Like

15. Taylah - September 28, 2020

First off I would like to say excellent blog!
I had a quick question that I’d like to ask if you don’t mind.
I was interested to find out how you center yourself and clear your thoughts before writing.
I’ve had a difficult time clearing my thoughts in getting my ideas out there.
I truly do take pleasure in writing but it just seems
like the first 10 to 15 minutes are generally lost simply just trying to figure out how to begin. Any recommendations or hints?
Cheers!

Like


Leave a comment