Technical Article

Regular Expressions and the Java Programming Language

by Dana Nourie and Mike McCloskey
Published August 2001, Updated April 2002

Applications frequently require text processing for features like word searches, email validation, or XML document integrity. This often involves pattern matching. Languages like Perl, sed, or awk improves pattern matching with the use of regular expressions, strings of characters that define patterns used to search for matching text. To pattern match using the Java programming language required the use of the StringTokenizer class with many charAt substring methods to read through the characters or tokens to process the text. This often lead to complex or messy code.

Until now.

The Java 2 Platform, Standard Edition (J2SE), version 1.4, contains a new package called java.util.regex, enabling the use of regular expressions. Now functionality includes the use of meta characters, which gives regular expressions versatility.

This article provides an overview of the use of regular expressions, and details how to use regular expressions with the java.util.regex package, using the following common scenarios as examples:

  • Simple word replacement
  • Email validation
  • Removal of control characters from a file
  • File searching

To compile the code in these examples and to use regular expressions in your applications, you'll need to install J2SE version 1.4. [Editor's note: The latest version of Java SE is available here.]

Regular Expressions Constructs

A regular expression is a pattern of characters that describes a set of strings. You can use the java.util.regex package to find, display, or modify some or all of the occurrences of a pattern in an input sequence.

The simplest form of a regular expression is a literal string, such as "Java" or "programming." Regular expression matching also allows you to test whether a string fits into a specific syntactic form, such as an email address.

To develop regular expressions, ordinary and special characters are used:

\$ ^ . *
+ ? [' ']
\.      

Any other character appearing in a regular expression is ordinary, unless a \ precedes it.

Special characters serve a special purpose. For instance, the . matches anything except a new line. A regular expression like s.n matches any three-character string that begins with s and ends with n, including sun and son.

There are many special characters used in regular expressions to find words at the beginning of lines, words that ignore case or are case-specific, and special characters that give a range, such as a-e, meaning any letter from a to e.

Regular expression usage using this new package is Perl-like, so if you are familiar with using regular expressions in Perl, you can use the same expression syntax in the Java programming language. If you're not familiar with regular expressions here are a few to get you started:

Construct Matches
Characters
x The character x
\\ The backslash character
\0n The character with octal value 0n (0 <= n <= 7)
\0nn The character with octal value 0nn (0 <= n <= 7)
\0mnn The character with octal value 0mnn (0 <= m <= 3, 0 <= n <= 7)
\xhh The character with hexadecimal value 0xhh
\uhhhh The character with hexadecimal value 0xhhhh
\t The tab character ('\u0009')
\n The newline (line feed) character ('\u000A')
\r The carriage-return character ('\u000D')
\f The form-feed character ('\u000C')
\a The alert (bell) character ('\u0007')
\e The escape character ('\u001B')
\cx The control character corresponding to x
Character Classes
[abc] a, b, or c (simple class)
[^abc] Any character except a, b, or c (negation)
[a-zA-Z] a through z or A through Z, inclusive (range)
[a-z-[bc]] a through z, except for b and c: [ad-z] (subtraction)
[a-z-[m-p]] a through z, except for m through p: [a-lq-z]
[a-z-[^def]] d, e, or f
Predefined Character Classes
. Any character (may or may not match line terminators)
\d A digit: [0-9]
\D A non-digit: [^0-9]
\s A whitespace character: [ \t\n\x0B\f\r]
\S A non-whitespace character: [^\s]
\w A word character: [a-zA-Z_0-9]
\W A non-word character: [^\w]

Check the documentation about the Pattern class for more specific details and examples.

Classes and Methods

The following classes match character sequences against patterns specified by regular expressions.

Pattern Class

An instance of the Pattern class represents a regular expression that is specified in string form in a syntax similar to that used by Perl.

A regular expression, specified as a string, must first be compiled into an instance of the Pattern class. The resulting pattern is used to create a Matcher object that matches arbitrary character sequences against the regular expression. Many matchers can share the same pattern because it is stateless.

The compile method compiles the given regular expression into a pattern, then the matcher method creates a matcher that will match the given input against this pattern. The pattern method returns the regular expression from which this pattern was compiled.

The split method is a convenience method that splits the given input sequence around matches of this pattern. The following example demonstrates:



/*
 * Uses split to break up a string of input separated by
 * commas and/or whitespace.
 */
import java.util.regex.*;

public class Splitter {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match breaks
        Pattern p = Pattern.compile("[,\\s]+");
        // Split input with the pattern
        String[] result = 
                 p.split("one,two, three   four ,  five");
        for (int i=0; i<result.length; i++)
            System.out.println(result[i]);
    }

Matcher Class

Instances of the Matcher class are used to match character sequences against a given string sequence pattern. Input is provided to matchers using the CharSequence interface to support matching against characters from a wide variety of input sources.

A matcher is created from a pattern by invoking the pattern's matcher method. Once created, a matcher can be used to perform three different kinds of match operations:

  • The matches method attempts to match the entire input sequence against the pattern.
  • The lookingAt method attempts to match the input sequence, starting at the beginning, against the pattern.
  • The find method scans the input sequence looking for the next sequence that matches the pattern.

Each of these methods returns a boolean indicating success or failure. More information about a successful match can be obtained by querying the state of the matcher.

This class also defines methods for replacing matched sequences by new strings whose contents can, if desired, be computed from the match result.

The appendReplacement method appends everything up to the next match and the replacement for that match. The appendTail appends the strings at the end, after the last match.

For instance, in the string blahcatblahcatblah, the first appendReplacement appends blahdog. The second appendReplacement appends blahdog, and the appendTail appends blah, resulting in: blahdogblahdogblah.

CharSequence Interface

The CharSequence interface provides uniform, read-only access to many different types of character sequences. You supply the data to be searched from different sources. String, StringBuffer and CharBuffer implement CharSequence, so they are easy sources of data to search through. If you don't care for one of the available sources, you can write your own input source by implementing the CharSequence interface.

Example Regex Scenarios

The following code samples demonstrate the use of the java.util.regex package for various common scenarios:

Simple Word Replacement



/*
 * This code writes "One dog, two dogs in the yard."
 * to the standard-output stream:
 */
import java.util.regex.*;

public class Replacement {
    public static void main(String[] args) 
                         throws Exception {
        // Create a pattern to match cat
        Pattern p = Pattern.compile("cat");
        // Create a matcher with an input string
        Matcher m = p.matcher("one cat," +
                       " two cats in the yard");
        StringBuffer sb = new StringBuffer();
        boolean result = m.find();
        // Loop through and create a new String 
        // with the replacements
        while(result) {
            m.appendReplacement(sb, "dog");
            result = m.find();
        }
        // Add the last segment of input to 
        // the new String
        m.appendTail(sb);
        System.out.println(sb.toString());
    }
}

Email Validation

The following code is a sample of some characters you can check are in an email address, or should not be in an email address. It is not a complete email validation program that checks for all possible email scenarios, but can be added to as needed.



/*
* Checks for invalid characters
* in email addresses
*/
public class EmailValidation {
   public static void main(String[] args) 
                                 throws Exception {
                                 
      String input = "@sun.com";
      //Checks for email addresses starting with
      //inappropriate symbols like dots or @ signs.
      Pattern p = Pattern.compile("^\\.|^\\@");
      Matcher m = p.matcher(input);
      if (m.find())
         System.err.println("Email addresses don't start" +
                            " with dots or @ signs.");
      //Checks for email addresses that start with
      //www. and prints a message if it does.
      p = Pattern.compile("^www\\.");
      m = p.matcher(input);
      if (m.find()) {
        System.out.println("Email addresses don't start" +
                " with \"www.\", only web pages do.");
      }
      p = Pattern.compile("[^A-Za-z0-9\\.\\@_\\-~#]+");
      m = p.matcher(input);
      StringBuffer sb = new StringBuffer();
      boolean result = m.find();
      boolean deletedIllegalChars = false;

      while(result) {
         deletedIllegalChars = true;
         m.appendReplacement(sb, "");
         result = m.find();
      }

      // Add the last segment of input to the new String
      m.appendTail(sb);

      input = sb.toString();

      if (deletedIllegalChars) {
         System.out.println("It contained incorrect characters" +
                           " , such as spaces or commas.");
      }
   }
}

Removing Control Characters from a File



/* This class removes control characters from a named
*  file.
*/
import java.util.regex.*;
import java.io.*;

public class Control {
    public static void main(String[] args) 
                                 throws Exception {
                                 
        //Create a file object with the file name
        //in the argument:
        File fin = new File("fileName1");
        File fout = new File("fileName2");
        //Open and input and output stream
        FileInputStream fis = 
                          new FileInputStream(fin);
        FileOutputStream fos = 
                        new FileOutputStream(fout);

        BufferedReader in = new BufferedReader(
                       new InputStreamReader(fis));
        BufferedWriter out = new BufferedWriter(
                      new OutputStreamWriter(fos));

	// The pattern matches control characters
        Pattern p = Pattern.compile("{cntrl}");
        Matcher m = p.matcher("");
        String aLine = null;
        while((aLine = in.readLine()) != null) {
            m.reset(aLine);
            //Replaces control characters with an empty
            //string.
            String result = m.replaceAll("");
            out.write(result);
            out.newLine();
        }
        in.close();
        out.close();
    }
}

File Searching



/*
 * Prints out the comments found in a .java file.
 */
import java.util.regex.*;
import java.io.*;
import java.nio.*;
import java.nio.charset.*;
import java.nio.channels.*;

public class CharBufferExample {
    public static void main(String[] args) throws Exception {
        // Create a pattern to match comments
        Pattern p = 
            Pattern.compile("//.*$", Pattern.MULTILINE);
        
        // Get a Channel for the source file
        File f = new File("Replacement.java");
        FileInputStream fis = new FileInputStream(f);
        FileChannel fc = fis.getChannel();
        
        // Get a CharBuffer from the source file
        ByteBuffer bb = 
            fc.map(FileChannel.MAP_RO, 0, (int)fc.size());
        Charset cs = Charset.forName("8859_1");
        CharsetDecoder cd = cs.newDecoder();
        CharBuffer cb = cd.decode(bb);
        
        // Run some matches
        Matcher m = p.matcher(cb);
        while (m.find())
            System.out.println("Found comment: "+m.group());
    }
}

Conclusion

Pattern matching in the Java programming language is now as flexible as in many other programming languages. Regular expressions can be put to use in applications to ensure data is formatted correctly before being entered into a database, or sent to some other part of an application, and they can be used for a wide variety of administrative tasks. In short, you can use regular expressions anywhere in your Java programming that calls for pattern matching.

For More Information

About the Authors

Dana Nourie is a JDC technical writer. She enjoys exploring the Java platform, especially creating interactive web applications using servlets and JavaServer Pages technologies, such as the JDC Quizzes and Learning Paths and Step-by-Step pages. She is also a scuba diver and is looking for the Pacific Cold Water Seahorse.

Mike McCloskey is a Sun engineer, working in Core Libraries for J2SE. He has made contributions in java.lang, java.util, java.io and java.math, as well as the new packages java.util.regex and java.nio. He enjoys playing racquetball and writing science fiction.