Using Regular Expressions

Purpose

In this tutorial, you learn how to use regular expression support. This new feature is introduced in Oracle Database 10g.

Time to Complete

Approximately 30 minutes

Topics

This tutorial covers the following topics:

	Overview
	Prerequisites
	Using Regular Expressions in Oracle Database 10g
	Using Basic Searches
	Using Multilingual Capabilities
	Regular Expressions and Check Constraints
	Summary

Viewing Screenshots

Place the cursor over this icon to load and view all the screenshots for this tutorial. (Caution: This action loads all screenshots simultaneously, so response time may be slow depending on your Internet connection.)

Note: Alternatively, you can place the cursor over an individual icon in the following steps to load and view only the screenshot associated with that step. You can hide an individual screenshot by clicking it.

Overview

In Oracle Database 10g, you can use both SQL and PL/SQL to implement regular expression support. Regular expressions are a method of describing both simple and complex patterns for searching and manipulating. String manipulation and searching contribute to a large percentage of the logic in a Web-based application. Usage ranges from the simple (for example, find the word San Francisco in a specified text) to the complex (for example, extract all URLs from the text) to the more complex (for instance, find all words in which every second character is a vowel).

Oracle Database 10g introduces support for regular expressions. The implementation complies with the Portable Operating System for UNIX (POSIX) standard, issued by the Institute of Electrical and Electronics Engineers (IEEE), for ASCII data-matching semantics and syntax. Oracle's multilingual capabilities extend the matching capabilities of the operators beyond the POSIX standard.

When coupled with native SQL, the use of regular expressions makes it possible to perform powerful search and manipulation operations on any data stored in an Oracle database. You can use this feature to easily solve problems that would otherwise be very complex to program.

Back to Topic List

Prerequisites

Before you perform this tutorial, you should:

1.	Perform the Installing Oracle Database 10g on Windows tutorial.
2.	Download and unzip the regexp.zip file into your working directory(c:\wkdir).

Back to Topic List

Using Regular Expressions in Oracle Database 10g

Matching Mechanism

If you have a string aabcd and you specify a search for a(b|c)d, the search looks for a followed by either b or c, which is then followed by d.

Regular Expression: 'a(b|c)d'
String to Match:    'aabcd'

a	a	b	c	d	Description	Result
*					Look for a and succeed	Match
	*				Look for b and fail	No match
	*				Look for c and fail, reset, and advance	No match
	*				Look for a and succeed	Match
		*			Look for b and succeed; remember c as alternative	Match
			*		Look for d and fail	No match
		*			Look for c as last remembered alternative and fail, reset, and advance	No match
		*			Look for a and fail, reset, and advance	No match
			*		Look for a and fail, reset, and advance	No match
				*	Look for a and fail, reset, and advance	No match

Given the string aabcd, a(b|c)d does not match it.

To implement regular expression support in either SQL or PL/SQL, you use a new set of functions. These functions are:

Function Name	Description
REGEXP_LIKE	Similar to the LIKE operator, but performs regular expression matching instead of simple pattern matching
REGEXP_INSTR	Searches for a given string for a regular expression pattern and returns the position were the match is found
REGEXP_REPLACE	Searches for a regular expression pattern and replaces it with a replacement string
REGEXP_SUBSTR	Searches for a regular expression pattern within a given string and returns the matched substring

POSIX Metacharacters in Oracle Database Regular Expressions

Metacharacters are special characters that have a special meaning, such as a wild card character, a repeating character, a nonmatching character, or a range of characters.

You can use several predefined metacharacter symbols in the pattern matching with the functions.

Symbol	Description
*	Matches zero or more occurrences
\|	Alternation operator for specifying alternative matches
^/$	Matches the start of line and the end of line
[]	Bracket expression for a matching list matching any one of the expressions represented in the list
[^exp]	If the caret is inside the bracket, it negates the expression.
{m}	Matches exactly m times
{m,n}	Matches at least m times but no more than n times
[: :]	Specifies a character class and matches any character in that class
\	Can have four different meanings: (1) stand for itself; (2) quote the next character; (3) introduce an operator; (4) do nothing
+	Matches one or more occurrences
?	Matches zero or one occurrence
.	Matches any character in the supported character set (except NULL)
()	Grouping expression (treated as a single subexpression)
\n	Backreference expression
[==]	Specifies equivalence classes
[..]	Specifies one collation element (such as a multicharacter element)

Perl Regular Expression Extensions

In addition to the POSIX standard, Oracle supports the common Perl-influenced meta characters. If you are a life sciences developer who relies on Perl to do pattern analysis on bioinformatics data stored in huge databases of DNAs and proteins, you can use SQL Regular Expression support directly on the data rather than from the middle tier. This provides you with a more efficient solution. The metacharacters added for Perl compatability are:

Operator	Description
\d	Match a digit character
\D	Match a non-digit character
\w	Match a word character
\W	Match a non-word character
\s	Match a whitespace character
\S	Match a non-whitespace character
\A	Match only at beginning of string
\Z	Match only at end of string, or before newline at the end
\z	Match only at end of string
*?	Match 0 or more times (non-greedy)
+?	Match 1 or more times (non-greedy)
??	Match 0 or 1 time ( non-greedy)
{n}?	Match exactly n times (non-greedy)
{n,}?	Match at least n times (non-greedy)
{n,m}?	Match at least n but not more than m times (non-greedy)

For more information on Perl-Influenced Extensions in Oracle Regular Expressions, refer to the Oracle Database Application Developer's Guide - Fundamentals 10g Release 2 (10.2), chapter 4 "Using Regular Expressions in Oracle Database".

Back to Topic List

Using Basic Searches

You can use the regular expression functions to perform basic searches.

From a terminal window, change to the wkdir directory.

cd\wkdir

Set your NLS_LANG environment setting as follows:

set NLS_LANG=.AL32UTF8

Start SQL*Plus, connect to Oracle with user ID and password oe/oe.

sqlplus oe/oe

Examine the syntax of the REGEXP_LIKE function:

REGEXP_LIKE(srcstr, pattern [,match_option])

In this function:

srcstr: is the search value
pattern: is the regular expression
match_option: provides the option to change default matching. This can include one or more of the following values:

'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line

To locate all products with names containing SSP/S, SSP/V, SSS/V, or SSS/S in the PRODUCT_NAME column from the PRODUCT_INFORMATION table, execute the following script:

@relike.sql

SELECT product_name
FROM oe.product_information
WHERE regexp_like (product_name, 'SS[PS]/[VS]');

The REGEXP_INSTR function returns the position of a given pattern within a string. Examine the syntax:

REGEXP_INSTR(srcstr, pattern [, position [, occurrence
[, return_option [, match_option]]]])

In this function:

position: is the search starting position
occurrence: is the occurrence to search for
return_option: Indicates the start or end position of occurrence
match_option: provides the option to change default matching. This can include one or more of the following values:

'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line

To search the product names to find the location of the first nonalphabetic character (regardless of whether it is uppercase or lowercase), execute the following script:

@reinstr.sql

COLUMN product_name FORMAT a45
COLUMN non_alpha FORMAT 9999999999
SELECT product_name, REGEXP_INSTR(product_name, '[^[:alpha:]]') non_alpha
FROM oe.product_information ;

Note that [^[:<class>:]] implies a character class and matches any character from within that class; [:alpha:] matches any alphabetic character. In this case, you are negating this expression by using ^.

The REGEXP_SUBSTR function returns a given string based on a pattern of occurrence. Examine the syntax:

REGEXP_SUBSTR(srcstr, pattern [, position
[, occurrence [, match_option]]])

In this function:

position: is the search starting position
occurrence: is the occurrence to search for
match_option: provides the option to change default matching. This can include one or more of the following values:

'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator to match the newline character
'm': Treat source string as multiple line

You want to extract the e-mail names from the CUSTOMERS table, but extract only those e-mail names for the customers located in Switzerland. To do this, you need to return the contents in the CUST_EMAIL column that precedes the @ symbol for customers with an NLS_TERRITORY of Switzerland. Execute the following script:

@resubstr.sql

SELECT REGEXP_SUBSTR(cust_email, '[^@]+')
FROM oe.customers
WHERE nls_territory = 'SWITZERLAND' ;

Note that in this example, the result returns the first substring that does not have the @ symbol.

The REGEXP_REPLACE function returns a given string with every occurrence of the pattern replaced with a replacement string. Examine the syntax:

REGEXP_REPLACE(srcstr, pattern [,replacestr [, position
[, occurrence [, match_option]]]])

In this function:

position: is the search starting position
occurrence: is the occurrence to search for
replacestr: is the character string replacing pattern
match_option: provides the option to change default matching. This can include one or more of the following values:

'c': Use case-sensitive matching (default)
'i': Use case-insensitive matching
'n': Allow match-any-character operator
'm': Treat source string as multiple line

To return information from the CATALOG_URL column in the PRODUCT_INFORMATION table, you can perform a full scan on the column. However, this would result in hundreds of rows being returned, because it lists a specific HTML page location within a number of catalog domains.

In this example, you want to find only the individual domain names themselves and not the lower-level pages they contain. To find the domain names without all the extraneous information, use the REGEXP_REPLACE function. Execute the following script:

@rereplace.sql

SELECT UNIQUE REGEXP_REPLACE (catalog_url, 'http://([^/]+).*', '\1')
FROM oe.product_information ;

Here is an explanation of how the string was processed:

http://	The expression starts by looking for this string literal; there are no special metacharacters here.
([^/]+)	Then, the expression looks for a series of characters provided that they are "not" slash (/).
.*	The expression finishes by consuming the rest of the string with this part of the expression.
\1	The matching expression is replaced with backreference 1, which is whatever was matched between the first set of parentheses.

Back to Topic List

Using Multilingual Capabilities

Regular expression functions support multilingual capabilities and can be used in locale-sensitive applications. To combine the use of regular expressions with Oracle's NLS Language feature, perform the following steps:

1.	Execute the following script to find the product description in the Portuguese language: @multiport.sql SELECT regexp_substr(translated_name, '^[a-z]+') FROM oe.product_descriptions WHERE language_id = 'PT' AND translated_name like 'G%' ; Note that the data is not displayed. The ^ is outside the bracket, which means that you are searching for any strings or substrings that start with any character from a to z.
2.	Perform the same query, but this time use the case-insensitive 'i' switched on. Execute the following script: @multiport2.sql SELECT regexp_substr(translated_name, '^[a-z]+', 1, 1, 'i') FROM oe.product_descriptions WHERE language_id = 'PT' AND translated_name like 'G%' ;
3.	The results are still incomplete because the returned strings are trimmed as soon as a non-English character is encountered. This is because the range [a-z] is sensitive to NLS_LANGUAGE. Range expressions are sensitive to NLS_SORT, which in this example, is derived from NLS_LANGUAGE. You need to set the NLS_LANGUAGE parameter appropriately to return the complete results. Execute the following query: @multiport3.sql ALTER SESSION SET NLS_LANGUAGE=PORTUGUESE; SELECT regexp_substr(translated_name, '^[a-z]+', 1, 1, 'i') FROM oe.product_descriptions WHERE language_id = 'PT' AND translated_name like 'G%' ;
4.	The final step is to view the results in both English and Portuguese to ensure that the translation has taken place. Execute the following script: @multiport4.sql SELECT REGEXP_SUBSTR(i.product_name, '^[a-z]+', 1, 1, 'i') \|\| ' = ' \|\| regexp_substr(d.translated_name, '^[a-z]+', 1, 1, 'i') FROM oe.product_descriptions d, oe.product_information i WHERE d.language_id = 'PT' AND d.translated_name like 'G%' AND i.product_id = d.product_id ; ALTER SESSION SET NLS_LANGUAGE=AMERICAN;

Back to Topic List

Using Regular Expressions and Check Constraints

You can use regular expressions with check constraints. When you define the check constraint, you can add the regular expression syntax condition to check that the data complies with the constraint.

1.	Adding a check constraint on the CUST_EMAIL column of the CUSTOMERS table ensures that only strings containing an @ symbol are accepted. Execute the following script: @chkemail.sql INSERT INTO customers VALUES (9999,'Christian','Patel', cust_address_typ ('1003 Canyon Road','87501', 'Santa Fe','NM','US'), phone_list_typ ('+1 505 243 4144'),'us','AMERICA','100', 'ChrisP+creme.com', 149, null, null, null, null, null) ; Because there was no validation being performed, an e-mail address not containing an @ symbol was accepted. After the script runs, issue the ROLLBACK statement at the SQL> prompt before starting the next step. ROLLBACK;
2.	Implement the constraint by executing the following script: @chkemail2.sql ALTER TABLE customers ADD CONSTRAINT cust_email_addr CHECK(REGEXP_LIKE(cust_email,'@'))NOVALIDATE ;
3.	Test the constraint by executing chkemail.sql again. @chkemail.sql The check constraint is violated because the e-mail address does not contain the required symbol. The NOVALIDATE clause ensures that existing data is not checked.
4.	Remove the constraint by executing the following script: @chkemail3.sql ALTER TABLE customers DROP CONSTRAINT cust_email_addr ;

Back to Topic List

Summary

In this tutorial, you learned how to:

	Perform a Variety of Searches to Access Data by Using Regular Expressions
	Access Data in Other Languages by Using Regular Expressions

Back to Topic List

Purpose

Topics

Viewing Screenshots

Overview

Prerequisites

Using Regular Expressions in Oracle Database 10g

Matching Mechanism

POSIX Metacharacters in Oracle Database Regular Expressions

Using Basic Searches

Using Multilingual Capabilities

Using Regular Expressions and Check Constraints

Summary

Place the cursor over this icon to hide all screenshots.