User Lexer Sample Code

Files

What's a User Lexer?

Oracle Text allows you to provide your own "plug-in" modules at various points in the indexing chain. For example, you can provide a user datastore, a user filter, and now (as of 9.2) a user lexer.

So what's a lexer?

The lexer is the component responsible for splitting text into individual words, or tokens. It is also responsible for processing compound words in languages such as German and Dutch, where the word "redsportscar" might need to be indexed as "red", "sports" and "car".

Why C?

The lexer has a lot of work to do, and gets called for every row in the table, as well as every query. Experience has shown that neither PL/SQL nor Java are really fast enough for this task. So it has to be implemented as an external procedure in C.

What is this demo

This demo provides a simple user lexer which breaks words into whitespace-delimited tokens, and upper-cases them. This is basically the same as the default English lexer in Oracle Text. It is designed as a "shell" into which users can fit their own special language processing requirements.

How do I install it?

See the comments at the top of user_lexer.c

Instructions are currently provided for Windows, Unix instructions to follow.

After compiling and installing the C dynamic linked library, you should run user_lexer.sql, which will install the necessary PL/SQL calling procedures, and demonstrate the use of the user_lexer on a simple table.

Note the C code currently writes debugging information to a file "C:\debug.txt". You may want to change this, or remove references to it completely.

What are the limitations?

  • It probably doesn't work with multi-byte characters (next enhancement is full UTF-8 support)
  • There is no attempt at compound word processing
  • There is no marking of end-of-sentence or end-of-paragraph

Is it supported?

No. This is SAMPLE CODE, which is not supported. You can request help from Oracle Support on issues which arise from using this code, but you cannot expect them to debug problems with the code. If you email the author at roger.ford@oracle.com, I will do my best to help you, but can promise any specific level of support.

Is it tested?

Yes, but only on a single machine at present, and only with the data as included in user_lexer.sql. Please send any feedback to the author