jump to navigation

Outlier Values – An Enemy Of The Index December 13, 2007

Posted by Richard Foote in Index Access Path, Indexing Tricks, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Outlier Values.
trackback

Outlier values are basically values that sit way way outside the standard range of a column’s normal value range.

Data can be a funny thing and sometimes there are values that are naturally “exceptional”. However, very commonly, outlier values are used by applications to represent bizarre default values, to avoid confusion with legitimate values. For example, I look after an application that uses the American Date Of Independence as it’s “default” date.

Usually, these weird outlier values are used to avoid nulls values, as nulls can be problematic and can not be indexed (well actually you can index a null column but we’ll leave that for another blog entry).

However, outlier values while (maybe) solving one problem, can introduce some very significant problems in return.

Firstly, the CBO “hates” outlier values as it potentially totally screws up the CBO’s selectivity calculations. The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value). Therefore if this calculation is invalidated by a massive and disprotionate “hole” in the full range of possible values, the CBO can get things horribly wrong.

See here for a simple demonstration:  Outlier Selectivity Problem

Additionally, indexes “hate” outlier values as it prevents Oracle using the 90-10 block split to keep indexes nice and compact and is forced to use 50-50 block splits instead. Basically a 90-10 block split is considered if and only if the index entry to be inserted is equal or greater than the current maximum value.  An outlier value that is also the maximum value,  usually means monotonically increasing values (such as sequences, dates, etc.) don’t actually insert the maximum value. Therefore, not only do indexes perform 50-50 splits but this 50% of free space is never used, as all new values are all almost, but not quite, maximum values.

Little demo to highlight this problem: Outlier Index Space Utilisation Problem 

In summary, avoid outlier values if at all possible.  They generally cause more problems than they solve !!

About these ads

Comments»

1. Donald K. Burleson - December 13, 2007

Hi Richard,

As an index guru, you should explore the new 11g extended statistics . . . .

2. Brian Tkatch - December 13, 2007

“In summary, avoid outlier values if at all possible. They generally cause more problems than they solve !!”

But how? The data is what it is.

3. Don Seiler - December 13, 2007

Jonathan Lewis also touches on this subject in Chapter 6 of his CBO book in the section “Discrete Dangers”. Very good to highlight this for people developing applications and tempted to use these extreme values.

4. Richard Foote - December 13, 2007

Don, I’ve already publically shown my love for extended statistics :)

http://groups.google.com/group/comp.databases.oracle.server/msg/d509e60eee76042d

Brian, the data is sometimes what we set it to be. As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …

5. Alberto Dell'Era - December 13, 2007

Minor observation but maybe worth mentioning – the “hole” problem can be thought as a manifestation of skewness, which has two faces: skewness in the multiplicity of values (some values occur much more often than others) and skewness in the relative “distance” of values (the “hole” problem). The former can be a problem (especially) for equality predicates and equijoins, the latter for range predicates. Histograms can help for both – but of course, collecting an histogram is an option that has to be carefully weighted against their gotchas, so it’s way much better to avoid “holes” whenever possible (which is of course what you’re saying here).

6. Richard Foote - December 14, 2007

Hi Alberto, absolutely worth mentioning. Outlier values effectively causes a skew effect in the data distribution which could be addressed with histograms. However, histograms come with a cost and prevention (if possible) is the best cure. Also, it doesn’t of course address the index space problem.

PS. I also changed the wording in the Blog entry as per your (very good) suggestion. Thanks :)

7. Brian Tkatch - December 14, 2007

“As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …”

Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.

But if the data itself actually has an outlier value, not much can be done about it?

I never even thought about the subject mentioned here. Thank you for mentioning it.

8. Richard Foote - December 14, 2007

“Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.”

Exactly !!

“But if the data itself actually has an outlier value, not much can be done about it?”

Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.

“I never even thought about the subject mentioned here. Thank you for mentioning it.”

No worries :)

9. Brian Tkatch - December 17, 2007

“Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.”

And that is reading for another day. Sheesh, so much to learn. :)

10. Richard Foote - December 18, 2007

One never stops learning, get’s a tad boring otherwise :)


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,704 other followers

%d bloggers like this: