jump to navigation

Introduction To Linguistic Indexes – Part I January 3, 2008

Posted by Richard Foote in Index Access Path, Indexing Tricks, Linguistic Indexes, Oracle Indexes.
trackback

Characters are sorted by default based on numeric values defined by the default character encoding scheme (known as Binary Sorting). For us Australians, this is fine as we (generally) speak English and the English alphabet is nicely sorted in ascending order by ASCII and EBCDIC standards. However, many other languages are not so fortunate as the binary sort does not sort the data in many language’s alphabetic sort order.  Oracle has many Globalization Support features to help users in other languages get over these issues (all very interesting and topics for many a Blog entry in the future).

However, even us Australians have issues when it comes to “case-insensitive” searches, where data may be stored in many different cases (eg. Ziggy, ZIGGY, ZiGgY, etc.) and we want to return all data that matches a character value, regardless of its case.

The issue of course is that by default, all text searches are case-sensitive. For example a search WHERE name=’ZIGGY’ will only return ‘ZIGGY’ but not ‘Ziggy’ or ‘ZiGgY’ etc.

The standard fix is for the application to convert the data to a consistent case when performing the search. For example a search WHERE UPPER(Name) = ‘ZIGGY’ will return all values of “ZIGGY” regardless of their case but this will negate the use of any standard index on the Name column.

Therefore, a Function-Based index is required, say based on UPPER(Name), to ensure an efficient index access is possible for case insensitive searches.

However, this often requires an additional index to be created and for the application to be explicitly written to make use of the function-based index defined function.

Now the best cure for this problem is simply to ensure all data is stored in a consistent case (ZIGGY, ZIGGY, ZIGGY) but this may not always be practical or even desirable in some cases.

Another possible solution is the use of a Linguistic Index. This is an index that is created based on a specific case insensitive linguistic language or multilingual option that ensures the index entries are sorted in the linguistic language order, not on the default binary order of the database encoding scheme.

Basic steps are:

1) Create a Linguistic Index, eg.

CREATE INDEX case_search_ling_name_i ON case_search(NLSSORT(name,’NLS_SORT=GENERIC_M_CI’));

2) Set NLS_SORT in the session (or set parameter) to use the required Linguistic sort option , eg.

ALTER SESSION SET NLS_SORT=’GENERIC_M_CI';   

Simply append _CI in the Linguistic sort option to make it Case-Insensitive or _AI to make it Accent-Insensitive.

(Note: if binary ordering is generally adequate, NLS_SORT can simply be set to ‘BINARY_CI’ for Binary Case-Insensitive searches)

3) Set NLS_COMP in the session (or set parameter) to use Linguistic Sorts/Case Insensitive Searches, eg.

ALTER SESSION SET NLS_COMP=’LINGUISTIC';

A search now based on WHERE name=’ZIGGY’ will automatically perform a case-insensitive search without the need to modify the application to use specific functions.

For a full demo, see Use Linguistic Indexes Demo.

However, before you rush out and start using Linguistic Indexes to possibly simplify the use of case insensitive searches, note there are various disadvantages to Linguistic Indexes, which can somewhat dampen their appeal. These will be covered in Part II of this series.

About these ads

Comments»

1. Brian Tkatch - January 3, 2008

At the end of the example, there were three INDEXes.

INDEX 1: case_search_name_i: name
INDEX 2: case_search_upper_name_i: UPPER(name)
INDEX 3: case_search_ling_name_i: NLSSORT(name,’NLS_SORT=GENERIC_M_CI’)

Case 1: WHERE name = ‘ZIGGY';

– INDEX 1 used, because it was the only one.

Case 2: WHERE UPPER(name) = ‘ZIGGY';

– INDEX 1 skipped because there was a FUNCTION in the search
– INDEX 2 used because the FBI’s FUNCTION matched the one used.

Case 3: WHERE name = ‘ZIGGY';

– INDEX 1 skipped because the NLS_SORT did not match.
– INDEX 2 skipped because there is no FUNCTION in the statement.
– INDEX 3 used because the NLS_SORT matched.

Is that correct?

2. Richard Foote - January 4, 2008

Hi Brian

Yep, basically correct

Case 1, used because it was the only one and Oracle found it cheaper than performing a full table scan. But it can’t be used for case-insensitive searches unless you actually list all possible permutations in an IN list or some such.

Case 3, INDEX 2 was skipped also because it’s a binary index and binary indexes are ignored if the NLS_SORT value doesn’t match, as with INDEX 1.

But that’s a discussion for another day ;)

3. techblog » Blog Archive » Oracle Linguistic Indexes - January 4, 2008

[...] at Nominet. We have been following the writings of Richard Foote and in particular an article on Linguistic Indexes. I thought the article interesting though somewhat obscure and filed it at the back of [...]

4. César Arriaga - March 5, 2010

When I read it, I was very pleased about the solution given.

Afterwards I noticed that fixing LINGUISTIC, all my database’s PK and INDEXES of char types where not usable, because they wheren’t based on the linguistic function.

That is, I have fields (used in PKs and UNIQUE indexes) for which it has no sense to treat them as linguistic.

Then I tried to use the linguistic comparisson using NLSSORT in the where clause (only in these queries where the fields have sense, ie a person name) without fixing NLS_SORT to an AI collation:

SELECT * FROM people
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) LIKE NLSSORT(‘Ziggy’, ‘NLS_SORT=BINARY_AI’);

But it presents to serious problems:
-The linguistic index is not used.
-I am not able to get results when the pattern has wildcards (%).

¿Any solution?

Richard Foote - March 5, 2010

Hi César

Yes, I mention these issues in Part II:

http://richardfoote.wordpress.com/2008/01/09/introduction-to-linguistic-indexes-part-ii/

They unfortunately need to be design and incorporated at the start of the design process else mixing binary and linguistic requirements becomes problematic.

A possible option would be to use the application or a logon trigger to set the necessary session parameters when and as necessary and then set then back again as appropriate.

César Arriaga - March 6, 2010

Thanks a lot Richard for your rapid response! (When you are in trouble that is well worth).

Sorry cause some of my comments were covered in part II, and didn’t notice.

But there is a thing still a bit uncovered, that is: As you say from 11g on, LIKE works fine and linguistic indexes are considered by CBO. The question is that, this is true only when you have set NLS variables and you use LIKE directly on fields-params, but it doesn’t work when you apply NLSSORT function in the WHERE clause.

These are the cases (afer having a BINARY_AI index created on “name” column, in 11.1.0.7g and NLS_SORT=BINARY_AI and NLS_COM=LINGUISTIC in all cases):

1. WHERE name LIKE ‘ZiGGy%';

Works fine (returns expected) and index is used as you stated.

2. Using ‘=’ with NLSSORT in WHERE:
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) = NLSSORT(‘ZiGGy’, ‘NLS_SORT=BINARY_AI’);

Works fine (returns expected as there is a record with name=’Ziggy’) and INDEX IS USED!!!

3. Using LIKE:
a) WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) = NLSSORT(‘ZiGGy’, ‘NLS_SORT=BINARY_AI’);

Works fine (returns expected as there is a record with name=’Ziggy’) and but the index is NOT used.

b) Introducing ‘%':
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) = NLSSORT(‘ZiGG%’, ‘NLS_SORT=BINARY_AI’);

No records, nor index used.

My questions to you could be:
-Do you think case 3.a (index not used with LIKE when ‘=’ uses it) can be considered a reportable bug to Oracle?
-What is wrong with 3.b?

Thanks in advance.

5. Richard Foote - March 6, 2010

Hi César

Please note I’m now on my third glass of champagne in Qantas club so I might not be mentally on top form ;)

However I believe, 3.b doesn’t work because it’s looking for the CI string ‘ZiGG%’ which doesn’t exist because it’s treating the % as a literal, not as a wildcard. To perform a LIKE type search, you need to specify the LIKE condition, else Oracle can’t know what you trying to do.

So are you really using LIKE in your 3. examples ???

6. César Arriaga - March 6, 2010

Hi Richard.

Qantas, Champagne and browsing the web at the same time… charming!!! (far from here I suppose).

Sorry again, cause 3. was wrong (cut & paste for this post), and this is what should have been (LIKE):

3.a: Without wildcards:
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) LIKE NLSSORT(‘ZiGGy’, ‘NLS_SORT=BINARY_AI’);

3.b: With ‘%':
WHERE NLSSORT(name, ‘NLS_SORT=BINARY_AI’) LIKE NLSSORT(‘ZiGG%’, ‘NLS_SORT=BINARY_AI’);

So, about your question: In Oracle’s doc. I have read that instead of setting NLS_SORT to an AI collation and NLS_COMP to linguistic, what you can do is (always with an underlying linguistic index defined) use the NLSSORT() in the WHERE clause (as shown in 2. and 3. examples).

That sounded very interesting as I can maintain all the rest of char based indexes as no linguistic (think also of indexes defined by constraints that otherwise we should have to duplicate), and focus on the queries that need a linguistic search (using NLSSORT()).

In the 2. example (with ‘=’ instead of LIKE), CBO detects a linguistic predicate and searchs for a compatible linguistic index, that is found and used.

But when we issue the 3.a example, what I am not planning to use and just tried it to see how CBO was performing, we can see that the search works (without ‘%’) but the index is NOT being used.

3.b is the one I really would like to use as it eliminates the explained collateral problems on traditional indexes. In this case what is happening is that:
1. While with ‘=’ the linguistic index is being considered, with LIKE it is NOT (the same to 3.a). Why? Bug or feature?

2. The wildcard issue. I share your explanation. But it sounds to me a bug, cause if you set NLS parameters and issue a LIKE with wildcards CBO has no problems to separate de wildcard for the rest of the text (that CBO must convert to its linguistic form).

So the two big problems are:
-LIKE + NLSSORT() -> Index is not used (yes when ‘=’+NLSSORT())
-NLSSORT(‘%’): % is not working as wildcard so no matching records are get.

So close to this nice alternative…. :(

7. Richard Foote - March 8, 2010

Hi César

The short answer is I’ll need to check when I get back home. As I’ve done once before, I just discovered I didn’t bring my 3 pin power adaptor for my laptop here in the US so I’ve got about 3 hours of battery life life :(


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 1,821 other followers

%d bloggers like this: