Creating and Querying Oracle Text Indexes

 

Overview

Purpose

This tutorial covers the following:

  • Creating a CONTEXT Index and Querying with CONTAINS operator
  • Creating a CTXCAT Index and Querying with CATSEARCH operator
  • Creating a CTXRULE Index and Querying with MATCHES operator

Time to Complete

Approximately 1 hour

Introduction

An Oracle Text index is an Oracle Database domain index. To build your query application, you can create an index of type CONTEXT with a mixture of text and structured data columns, and query it with the CONTAINS operator.
You create an index from a populated text table. In a query application, the table must contain the text or pointers to where the text is stored. Text is usually a collection of documents, but can also be small text fragments.

You can use the CTXCAT index type when your application relies heavily on mixed queries to search small documents or descriptive text fragments based on related criteria such as dates or prices. Query this index with the CATSEARCH operator.
You can create an index of type CTXRULE to build a document classification application using simple or rule-based classification.

The benefits of a creating an Oracle Text index include fast response time for text queries with the CONTAINS, CATSEARCH, and MATCHES Oracle Text operators. With Oracle Text, you can create indexes of several types using CREATE INDEX.


Index Type Description Query Operator Notes
CONTEXT Use this index to build a text retrieval application when your text consists of large coherent documents.
You can index documents of different formats such as MS Word, HTML or plain text.
You can customize the index in a variety of ways. This index type requires CTX_DDL.SYNC_INDEX after DML on base table.
CONTAINS Grammar is called the CONTEXT grammar, which supports a rich set of operations.
The CTXCAT grammar can be used with query templating.
Supports all documents services and query services. Supports indexing of partitioned text tables. Supports FILTER BY and ORDER BY clauses of CREATE INDEX to also index structured column values for more efficient processing of mixed queries.
CTXCAT Use this index type for better mixed query performance. Typically, with this index type, you index small documents or text fragments.
Other columns in the base table, such as item names, prices, and descriptions can be included in the index to improve mixed query performance. This index type is transactional, automatically updating itself after DML to base table. No CTX_DDL.SYNC_INDEX is necessary.
CATSEARCH Grammar is called CTXCAT, which supports logical operations, phrase queries, and wildcarding. The CONTEXT grammar can be used with query templating. Theme querying is supported. This index is larger and takes longer to build than a CONTEXT index. The size of a CTXCAT index is related to the total amount of text to be indexed, the number of indexes in the index set, and the number of columns indexed. Carefully consider your queries and your resources before adding indexes to the index set. The CTXCAT index does not support index partitioning, documents services (highlighting, markup, themes, and gists) or query services (explain, query feedback, and browse words.)
CTXRULE
Use CTXRULE index to build a document classification or routing application. This index is created on a table of queries, where the queries define the classification or routing criteria.
MATCHES
Single documents (plain text, HTML, or XML) can be classified using the MATCHES operator, which turns a document into a set of queries and finds the matching rows in the CTXRULE index.

Prerequisites

Before starting this tutorial, you should:

 

Creating a CONTEXT Index and Querying with CONTAINS operator

To create a CONTEXT index, perform the following steps:

  1. Open a SQL Worksheet using myuser_conn and enter the following code to create a table called QUICK having a single column called TEXT which is VARCHAR2(80).
    Click the 
    Run Statement icon  to run the query.

    CREATE TABLE quick( text VARCHAR2(80));

    alt description here


    alt description here


  2. Run the SQL INSERT statement to load text to the QUICK table.

    INSERT INTO quick VALUES ('the cat sat on the mat');

    alt description here
  3. Enter the following SQL code to create a CONTEXT index on the text column to run a text query against the QUICK table.

    CREATE INDEX quickindex ON quick(text) INDEXTYPE IS CTXSYS.CONTEXT;

    alt description here
  4. Run the SELECT statement with CONTAINS operator.This retrieves the text that satisfy the query.

    SELECT text FROM quick WHERE CONTAINS (text, 'cat') > 0;

    Oracle Challenger bi-plane
  5.  

    Looking at the sub-tables used by Oracle Text

    1. Run the SELECT statement to query the tables from the USER_TABLES dictionary view.

      SELECT table_name FROM user_tables;

      alt description here

      Note: You can see the QUICK table you created, but you can also see that there are four additional tables, with names based on our index_name.

    2. Query the $I table.

      SELECT token_text FROM dr$quickindex$i;

      alt description here

      Note: Here, you can see the words from the "the cat sat on the mat" document. But  "the" and "on" did not appear. They did not appear because they are called the stop words - words that are considered not useful for searching, so they are not indexed. You can customize the list of stopwords for an index. The $I table is sometimes known as the "word list" table. It contains a list of words, each associated with some binary data (in the token_info column) which represents the locations of each word within the document set.

     

    Synchronizing a Text Index

    Synchronizing the index involves processing all pending updates, inserts, and deletes to the base table. You can do this in PL/SQL with the CTX_DDL.SYNC_INDEX procedure.

    1. Add a new row to the QUICK table using the following INSERT statement.

      INSERT INTO quick VALUES ('the cat sat on the dog');

      alt description here
    2. Query the $I table.

      SELECT token_text FROM dr$quickindex$i;

      alt description here

      Note: The word "dog" did not appear because the index has not yet been synchronized.

    3. Execute the following PL/SQL procedure to sync an index.

      EXECUTE ctx_ddl.sync_index('QUICKINDEX');

      alt description here
    4. Query the $I table.

      SELECT token_text FROM dr$quickindex$i;

      alt description here

      Note: The word "dog" appeared in our index because the index has been synced. But there's something else to notice here - do you see that there are multiple rows for "CAT" and "SAT"? This is index fragmentation - due to updates, our index is no longer in an optimal state. You can fix this by running another PL/SQL procedure optimize_index.

     

    Optimizing an Index

      Frequent index synchronization can fragment your CONTEXT index. Index fragmentation can adversely affect query response time. You can optimize your CONTEXT index to reduce fragmentation and index size and so improve query performance.
      This comes in various modes, which are beyond the scope of this tutorial, but for now we will just run it in FULL mode

    1. Execute the following PL/SQL procedure to optimize an index.

      EXECUTE ctx_ddl.optimize_index('QUICKINDEX', 'FULL');

      alt description here
    2. Query the $I table.

      SELECT token_text FROM dr$quickindex$i;

      alt description here

      Note: The index is now in an optimal state - best for space usage and for query performance. The more frequently you sync and index, the more fragmented it will become and the longer it will take to optimize. So any large scale Oracle Text implementation will have to plan how often to sync indexes, and how frequently to run optimize.A common scenario is to sync indexes every five minutes, and optimize them daily.

     

    Querying with CONTAINS operator

    When you create an index of type CONTEXT, you must use the CONTAINS operator to enter your query.

    1. Add two more rows to your QUICK table using the following INSERT statements:

      INSERT INTO quick VALUES ('the cat sat on the cat');
      INSERT INTO quick VALUES ('the dog sat on the cat');

      alt description here
    2. Enter the following code to synchronize your index:

      EXECUTE ctx_ddl.sync_index('QUICKINDEX');

      alt description here
    3. Run the following SELECT statement, specify the query in the WHERE clause with the CONTAINS operator:

      SELECT text FROM quick WHERE CONTAINS (text, 'cat sat') > 0;

      alt description here

      Note: cat sat' is a phrase search. Both words must appear together in the right order. Here, you did not find 'the dog sat on the cat'. If you want to find the words wherever they are we must use the AND operator (which can also be expressed as "&"). Here, the text 'the dog sat on the cat' did not appear.

    4. Run the following SELECT statement, specify the query in the WHERE clause with the CONTAINS operator and the AND operator:

      SELECT text FROM quick WHERE CONTAINS (text, 'cat AND sat') > 0;

      alt description here
     

    The SCORE() Operator

    CONTAINS returns a relevance score for every row selected. You obtain this score with the SCORE operator. The CONTAINS operator must always be followed by the > 0 syntax, which specifies that the score value returned by the CONTAINS operator must be greater than zero for the row to be returned.
    When the SCORE operator is called in the SELECT statement, the CONTAINS operator must reference the score label value in the third parameter.

    1. Run the following query specifying the SCORE() operator.

      SELECT text, score(99) FROM quick WHERE CONTAINS (text, 'cat', 99) > 0;

      alt description here

      Note: Here, the score for the second row is higher - there are two occurrences of "cat" in that one, so it is considered more 'relevant' to the search.

    2. You can order the results from the highest scoring documents to the lowest scoring documents using the ORDER BY clause as follows:

      SELECT text, score(99) FROM quick WHERE CONTAINS (text, 'cat', 99) > 0 ORDER BY SCORE(99) DESC;;

      alt description here
 

Creating a CTXCAT Index and Querying with CATSEARCH Operator

To create a CTXCAT index, perform the following steps:

  1. Open a SQL Worksheet using myuser_conn and enter the following code to create a table called AUCTION.
    Click the 
    Run Statement icon to run the query.

    create table auction(
    item_id number,
    title varchar2(100),
    category_id number,
    price number,
    bid_close date);

    alt description here


  2. Enter the following code to insert the data into the AUCTION table and click Run Script icon  to run the code.

    INSERT INTO auction values(1, 'Sony CD Player', 100, 201,'20-FEB-2000');
    INSERT INTO auction values(2, 'Sony CD Player', 200, 202,'24-FEB-2000');
    INSERT INTO auction values(3, 'Pioneer DVD Player', 250, 203, '25-FEB-2000');
    INSERT INTO auction values(4, 'Sony CD Player', 300, 204, '25-FEB-2000');
    INSERT INTO auction values(5, 'Bose Speaker', 350, 205, '22-FEB-2000');
    INSERT INTO auction values(6, 'Tascam CD Burner', 400, 206, '25-FEB-2000');
    INSERT INTO auction values(7, 'Nikon digital camera', 450, 207, '22-FEB-2000');

    alt description here


  3. A CTXCAT index (like a CONTEXT index) is a "domain" index. Therefore it supports the "PARAMETERS" clause. A number of possible parameter settings are shared with CONTEXT indexes. These are: LEXER, MEMORY, STOPLIST, STORAGE and WORDLIST (no other CONTEXT parameters are supported). Howevwer, the most important parameter is a new one: INDEX SET. INDEX SET defines the structured columns that are to be included in the CTXCAT index.

    To create the sub-indexes, create an index set to contain them using the following code:

    begin
    ctx_ddl.create_index_set('auction_set');
    ctx_ddl.add_index('auction_set', 'price');
    ctx_ddl.add_index('auction_set', 'bid_close');
    end;

    alt description here


  4. The CATSEARCH query operator takes a mandatory text clause and optional structured clause. Note that the title column is NOT part of the INDEX SET.
    In this tutorial, the title column is included in all the queries. Hence, you need to create an index on the title column. Create the combined catalog index using the following CREATE INDEX statement as follows:

    CREATE INDEX auction_index ON auction (title)
    INDEXTYPE IS CTXCAT
    PARAMETERS ('INDEX SET auction_set');

    alt description here


  5. Query the AUCTION table with the CATSEARCH operator as follows:

    SELECT * FROM auction WHERE CATSEARCH(title, 'CD', 'order by price desc')> 0;

    alt description here

    Note: The above query finds all rows that contain the word camera ordered by bid_close.

  6. Run the following query to find all rows that contain the phrase Sony CD Player and that have price less than 400:

    SELECT * FROM auction WHERE CATSEARCH(title, 'Sony CD Player', 'price < 400')> 0;

    alt description here
  7. Run the following query to find all rows that contain the phrase Sony CD Player having price less than 300 ordered by bid_close:

    SELECT * FROM auction WHERE CATSEARCH(title, 'Sony CD Player', 'price < 300
    ORDER BY bid_close')>0;

    alt description here
 

Creating a CTXRULE Index and Querying with MATCHES Operator

To create a CTXRULE index, perform the following steps:

  1. Open a SQL Worksheet using myuser_conn and enter the following code to create a table called myqueries to hold the category name and query text. Click the Run Statement icon to run the query.

    CREATE TABLE myqueries (
    queryid NUMBER PRIMARY KEY,
    category VARCHAR2(30),
    query VARCHAR2(2000));

    alt description here
  2. Insert data into the table using the following code:

    INSERT INTO myqueries VALUES(1, 'US Politics', 'democrat or republican');
    INSERT INTO myqueries VALUES(2, 'Music', 'ABOUT(music)');
    INSERT INTO myqueries VALUES(3, 'Soccer', 'ABOUT(soccer)');

    alt description here
  3. Enter the following CREATE INDEX statement to create the CTXRULE index:

    CREATE INDEX query_rule ON myqueries(query) INDEXTYPE IS ctxsys.ctxrule;

    alt description here
  4. Query the myqueries table with the MATCHES operator as follows:

    SELECT classification FROM myqueries querytable WHERE MATCHES(text, 'Smith is a common name in the United States') > 0;

    alt description here
 

Summary

In this tutorial, you learned to:

  • Create a CONTEXT Index and Querying with CONTAINS operator
  • Create a CTXCAT Index and Querying with CATSEARCH operator
  • Create a CTXRULE Index and Querying with MATCHES operator

Resources

To learn more about Oracle Text refer to:

Credits

Put credits here

  • Lead Curriculum Developer: Dimpi Sarmah
  • Other Contributors: Roger Ford