Outlier Values – An Enemy Of The Index | Richard Foote's Oracle Blog

Outlier Values – An Enemy Of The Index December 13, 2007

Posted by Richard Foote in Index Access Path, Indexing Tricks, Oracle Cost Based Optimizer, Oracle General, Oracle Indexes, Outlier Values.
trackback

Outlier values are basically values that sit way way outside the standard range of a column’s normal value range.

Data can be a funny thing and sometimes there are values that are naturally “exceptional”. However, very commonly, outlier values are used by applications to represent bizarre default values, to avoid confusion with legitimate values. For example, I look after an application that uses the American Date Of Independence as it’s “default” date.

Usually, these weird outlier values are used to avoid nulls values, as nulls can be problematic and can not be indexed (well actually you can index a null column but we’ll leave that for another blog entry).

However, outlier values while (maybe) solving one problem, can introduce some very significant problems in return.

Firstly, the CBO “hates” outlier values as it potentially totally screws up the CBO’s selectivity calculations. The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value). Therefore if this calculation is invalidated by a massive and disprotionate “hole” in the full range of possible values, the CBO can get things horribly wrong.

See here for a simple demonstration: Outlier Selectivity Problem

Additionally, indexes “hate” outlier values as it prevents Oracle using the 90-10 block split to keep indexes nice and compact and is forced to use 50-50 block splits instead. Basically a 90-10 block split is considered if and only if the index entry to be inserted is equal or greater than the current maximum value. An outlier value that is also the maximum value, usually means monotonically increasing values (such as sequences, dates, etc.) don’t actually insert the maximum value. Therefore, not only do indexes perform 50-50 splits but this 50% of free space is never used, as all new values are all almost, but not quite, maximum values.

Little demo to highlight this problem: Outlier Index Space Utilisation Problem

In summary, avoid outlier values if at all possible. They generally cause more problems than they solve !!

Comments»

1. Donald K. Burleson - December 13, 2007: Hi Richard,

As an index guru, you should explore the new 11g extended statistics . . . .

LikeLike

Reply
2. Brian Tkatch - December 13, 2007: “In summary, avoid outlier values if at all possible. They generally cause more problems than they solve !!”

But how? The data is what it is.

LikeLiked by 1 person

Reply
3. Don Seiler - December 13, 2007: Jonathan Lewis also touches on this subject in Chapter 6 of his CBO book in the section “Discrete Dangers”. Very good to highlight this for people developing applications and tempted to use these extreme values.

LikeLike

Reply
4. Richard Foote - December 13, 2007: Don, I’ve already publically shown my love for extended statistics 🙂

http://groups.google.com/group/comp.databases.oracle.server/msg/d509e60eee76042d

Brian, the data is sometimes what we set it to be. As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …

LikeLike

Reply
5. Alberto Dell'Era - December 13, 2007: Minor observation but maybe worth mentioning – the “hole” problem can be thought as a manifestation of skewness, which has two faces: skewness in the multiplicity of values (some values occur much more often than others) and skewness in the relative “distance” of values (the “hole” problem). The former can be a problem (especially) for equality predicates and equijoins, the latter for range predicates. Histograms can help for both – but of course, collecting an histogram is an option that has to be carefully weighted against their gotchas, so it’s way much better to avoid “holes” whenever possible (which is of course what you’re saying here).

LikeLike

Reply
6. Richard Foote - December 14, 2007: Hi Alberto, absolutely worth mentioning. Outlier values effectively causes a skew effect in the data distribution which could be addressed with histograms. However, histograms come with a cost and prevention (if possible) is the best cure. Also, it doesn’t of course address the index space problem.

PS. I also changed the wording in the Blog entry as per your (very good) suggestion. Thanks 🙂

LikeLike

Reply
7. Brian Tkatch - December 14, 2007: “As I mentioned, sometimes we use really bad values as defaults. For example, in the second demo, a default of 0 instead of the massive number would possibly have been a far better choice.

Sometimes the humble null is the best default value as well …”

Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.

But if the data itself actually has an outlier value, not much can be done about it?

I never even thought about the subject mentioned here. Thank you for mentioning it.

LikeLike

Reply
8. Richard Foote - December 14, 2007: “Ah, i see. So when using defaults keep it close to the expected values, or just use NULL.”

Exactly !!

“But if the data itself actually has an outlier value, not much can be done about it?”

Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.

“I never even thought about the subject mentioned here. Thank you for mentioning it.”

No worries 🙂

LikeLike

Reply
9. Brian Tkatch - December 17, 2007: “Correct, except perhaps creating a histogram so that the CBO understands the data range is skewed and not evenly distributed.”

And that is reading for another day. Sheesh, so much to learn. 🙂

LikeLike

Reply
10. Richard Foote - December 18, 2007: One never stops learning, get’s a tad boring otherwise 🙂

LikeLike

Reply
11. meet the needs - December 18, 2017: It’s truly very complex in this active life to listen news
on TV, therefore I simply use world wide web for that purpose,
and get the latest news.

LikeLike

Reply
12. Twyla - October 22, 2018: Great post. I was checking constantly this blog and I am impressed!

Very helpful information particularly the last
part 🙂 I care for such information much. I was looking
for this certain information for a very long time. Thank you and best of luck.

LikeLike

Reply
13. ceme online - October 29, 2018: Judi togel

Outlier Values

LikeLike

Reply
14. Julie - April 11, 2019: Excellent post. I was checking continuously this weblog and
I am impressed! Very useful information specifically the
remaining phase 🙂 I care for such information much. I used to
be seeking this certain information for a long time.
Thanks and good luck.

LikeLike

Reply
15. Taylah - September 28, 2020: First off I would like to say excellent blog!
I had a quick question that I’d like to ask if you don’t mind.
I was interested to find out how you center yourself and clear your thoughts before writing.
I’ve had a difficult time clearing my thoughts in getting my ideas out there.
I truly do take pleasure in writing but it just seems
like the first 10 to 15 minutes are generally lost simply just trying to figure out how to begin. Any recommendations or hints?
Cheers!

LikeLike

Reply

	PP on Costing Concatenated Indexes W…
	Zeeshan Moin on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	PP on The End Of Richard Foote…
	Ram on The Fake Index Trap (“No…
	Richard Foote on The End Of Richard Foote…
	NormanDunbar on The End Of Richard Foote…
	NormanDunbar on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Richard Foote on The End Of Richard Foote…
	Sayan Malakshinov on The End Of Richard Foote…

Richard Foote's Oracle Blog

Outlier Values – An Enemy Of The Index December 13, 2007

Comments»

Leave a comment Cancel reply

Contact Details

Richard’s Art

Recent Comments

Blog Visits

Recent Posts

Archives

Richard Foote’s Pages

Top Posts

Richard’s Categories

Feeds

Richard Foote's Oracle Blog

Outlier Values – An Enemy Of The Index December 13, 2007

Share this:

Related

Comments»

Leave a comment Cancel reply

Contact Details

Richard’s Art

Recent Comments

Blog Visits

Recent Posts

Archives

Richard Foote’s Pages

Top Posts

Richard’s Categories

Richard's Cloud

Feeds