Supplementary Characters in the Java Platform

By Norbert Lindenberg and Masayoshi Okutsu, Sun Microsystems, Inc., May 2004

Abstract

This article describes how supplementary characters are supported in the Java platform. Supplementary characters are characters in the Unicode standard whose code points are above U+FFFF, and which therefore cannot be described as single 16-bit entities such as the char data type in the Java programming language. Such characters are generally rare, but some are used, for example, as part of Chinese and Japanese personal names, and so support for them is commonly required for government applications in East Asian countries.

The Java platform is being enhanced to enable processing of supplementary characters with minimal impact on existing applications. New low-level APIs enable operations on individual characters where necessary. Most text-processing APIs, however, use character sequences, such as the String class or character arrays. These are now interpreted as UTF-16 sequences, and the implementations of these APIs is changed to correctly handle supplementary characters. The enhancements are part of version 5.0 of the Java 2 Platform, Standard Edition (J2SE).

Besides explaining these enhancements in detail, this article also provides guidelines for application developers for determining and implementing necessary changes to enable use of the complete Unicode character set.

Background

Unicode was originally designed as a fixed-width 16-bit character encoding. The primitive data type char in the Java programming language was intended to take advantage of this design by providing a simple data type that could hold any character. However, it turned out that the 65,536 characters possible in a 16-bit encoding are not sufficient to represent all characters that are or have been used on planet Earth. The Unicode standard therefore has been extended to allow up to 1,112,064 characters. Those characters that go beyond the original 16-bit limit are called supplementary characters. Version 2.0 of the Unicode standard was the first to include a design to enable supplementary characters, but it was only in version 3.1 that the first supplementary characters were assigned. Version 5.0 of the J2SE is required to support version 4.0 of the Unicode standard, so it has to support supplementary characters.

Support for supplementary characters is likely to also become a common business requirement in East Asian markets. Government applications are going to require them in order to correctly represent names that include rare Chinese characters. Publishing applications may need them in order to represent the full set of historical and variant characters. The Chinese government requires support for GB18030, a character encoding that encodes the entire Unicode character set, and so includes supplementary characters if Unicode version 3.1 or later is assumed. The Taiwanese standard CNS-11643 includes numerous characters that have been included in Unicode 3.1 as supplementary characters. The Hong Kong government defined a collection of characters that are needed for Cantonese, and some of these characters are supplementary characters in Unicode. Finally, some vendors in Japan are planning to use the large private use area in the supplementary character space for more than 50,000 kanji character variants in order to migrate from their proprietary systems to solutions based on the Java platform.

The Java platform therefore not only has to support supplementary characters, but it also has to make it easy for applications to do the same. Since supplementary characters break a fundamental assumption of the Java programming language and might require a fundamental change in the programming model, an expert group was convened under the Java Community Process to choose the right solution for the problem. The group is called the JSR-204 expert group, using the number of the Java Specification Request for Unicode Supplementary Character Support. Technically, the decisions of the expert group only apply to the J2SE platform, but since the Java 2 Platform, Enterprise Edition (J2EE) sits on top of the J2SE platform, it benefits directly, and we expect that the configurations of the Java 2 Platform, Micro Edition (J2ME) will adopt the same design approach.

But before we can look at the solution that the JSR-204 expert group came up with, we need to learn some terminology.

Code Points, Character Encoding Schemes, UTF-16: What's All This?

The introduction of supplementary characters unfortunately makes the character model quite a bit more complicated. Where in the past we could simply talk about "characters" and, in a Unicode based environment such as the Java platform, assume that a character has 16 bits, we now need more terminology. We'll try to keep it relatively simple -- for a full-blown discussion with all details you can read Chapter 2 of The Unicode Standard or Unicode Technical Report 17 " Character Encoding Model." Unicode experts may skip all but the last definition in this section.

A character is just an abstract minimal unit of text. It doesn't have a fixed shape (that would be a glyph), and it doesn't have a value. "A" is a character, and so is "€", the symbol for the common currency of Germany, France, and numerous other European countries.

A character set is a collection of characters. For example, the Han characters are the characters originally invented by the Chinese, which have been used to write Chinese, Japanese, Korean, and Vietnamese.

A coded character set is a character set where each character has been assigned a unique number. At the core of the Unicode standard is a coded character set that assigns the letter "A" the number 0041 ₁₆ and the letter "€" the number 20AC ₁₆. The Unicode standard always uses hexadecimal numbers, and writes them with the prefix "U+", so the number for "A" is written as "U+0041".

Code points are the numbers that can be used in a coded character set. A coded character set defines a range of valid code points, but doesn't necessarily assign characters to all those code points. The valid code points for Unicode are U+0000 to U+10FFFF. Unicode 4.0 assigns characters to 96,382 of these more than a million code points.

Supplementary characters are characters with code points in the range U+10000 to U+10FFFF, that is, those characters that could not be represented in the original 16-bit design of Unicode. The set of characters from U+0000 to U+FFFF is sometimes referred to as the Basic Multilingual Plane (BMP). Thus, each Unicode character is either in the BMP or a supplementary character.

A character encoding scheme is a mapping from the numbers of one or more coded character sets to sequences of one or more fixed-width code units. The most commonly used code units are bytes, but 16-bit or 32-bit integers can also be used for internal processing. UTF-32, UTF-16, and UTF-8 are character encoding schemes for the coded character set of the Unicode standard.

UTF-32 simply represents each Unicode code point as the 32-bit integer of the same value. It's clearly the most convenient representation for internal processing, but uses significantly more memory than necessary if used as a general string representation.

UTF-16 uses sequences of one or two unsigned 16-bit code units to encode Unicode code points. Values U+0000 to U+FFFF are encoded in one 16-bit unit with the same value. Supplementary characters are encoded in two code units, the first from the high-surrogates range (U+D800 to U+DBFF), the second from the low-surrogates range (U+DC00 to U+DFFF). This may seem similar in concept to multi-byte encodings, but there is an important difference: The values U+D800 to U+DFFF are reserved for use in UTF-16; no characters are assigned to them as code points. This means, software can tell for each individual code unit in a string whether it represents a one-unit character or whether it is the first or second unit of a two-unit character. This is a significant improvement over some traditional multi-byte character encodings, where the byte value 0x41 could mean the letter "A" or be the second byte of a two-byte character.

UTF-8 uses sequences of one to four bytes to encode Unicode code points. U+0000 to U+007F are encoded in one byte, U+0080 to U+07FF in two bytes, U+0800 to U+FFFF in three bytes, and U+10000 to U+10FFFF in four bytes. UTF-8 is designed so that the byte values 0x00 to 0x7F always represent code points U+0000 to U+007F (the Basic Latin block, which corresponds to the ASCII character set). These byte values never occur in the representation of other code points, a characteristic that makes UTF-8 convenient to use in software that assigns special meanings to certain ASCII characters.

The following table shows the different representations of a few characters in comparison:

Unicode code point	U+0041	U+00DF	U+6771	U+10400
Representative glyph
UTF-32 code units	`00000041`	`000000DF`	`00006771`	`00010400`
UTF-16 code units	`0041`	`00DF`	`6771`	`D801` `DC00`
UTF-8 code units	`41`	`C3` `9F`	`E6` `9D` `B1`	`F0` `90` `90` `80`

This article also uses the terms character sequence or char sequence in many places to summarize all the containers of character sequences that the Java 2 Platform knows: char[], implementations of java.lang.CharSequence (such as the String class), and implementations of java.text.CharacterIterator.

This is a lot of terminology. What does all this have to do with supporting supplementary characters in the Java platform?

Design Approach for Supplementary Characters in the Java Platform

The main decision the JSR-204 expert group had to make was how to represent supplementary characters in Java APIs, both for individual characters and for character sequences in all forms. A number of approaches were considered and rejected by the expert group:

Redefining the primitive type char to have 32 bits, which would also make char sequences in all forms UTF-32 sequences.
Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. Char sequences in all forms would be based on UTF-16.
Introducing a new primitive 32-bit type for characters (e.g., char32) in addition to the existing 16-bit type char. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16.
Using int to represent supplementary code points. String and StringBuffer receive parallel APIs to interpret them either as UTF-16 sequences or as UTF-32 sequences; other char sequences continue to be based on UTF-16.
Using surrogate char pairs to represent supplementary code points. char sequences in all forms would be based on UTF-16.
Introducing a class that encapsulates a character. String and StringBuffer receive new APIs to interpret them as sequences of such characters.
Using a combination of a CharSequence instance and an index to represent code points.

Some of these approaches were ruled out early on. Redefining the primitive type char to have 32 bits, for example, might have been very attractive for a brand-new platform, but for J2SE it would have been incompatible with existing Java virtual machines ¹ , serialization, and other interfaces, not to mention that UTF-32 based strings use twice as much memory as UTF-16 based ones. Adding a new type char32 would have been easier, but would still have created problems for virtual machines and serialization. Also, language changes usually have longer lead times than API changes, so the two previous approaches might have unacceptably delayed support for supplementary characters. To help determine the winner among the remaining ones, the implementation team actually implemented supplementary character support in a substantial piece of code that does low-level character processing (the java.util.regex package) using four different approaches, comparing them in terms of ease of development and runtime performance.

In the end, the decision was for a tiered approach:

Use the primitive type int to represent code points in low-level APIs, such as the static methods of the Character class.
Interpret char sequences in all forms as UTF-16 sequences, and promote their use in higher-level APIs.
Provide APIs to easily convert between various char and code point-based representations.

This approach provides a conceptually simple and efficient representation of individual characters where needed, while leveraging existing APIs that can be retrofitted to support supplementary characters. It also promotes the use of character sequences over individual characters, which is generally better for internationalized software.

With this approach, a char represents a UTF-16 code unit, which is not always sufficient to represent a code point. You'll see that the J2SE specifications now use the terms code point and UTF-16 code unit where the representation is relevant, and the generic term character where the representation is irrelevant to the discussion. APIs usually use the name codePoint for variables of type int that represent code points, while UTF-16 code units of course have type char.

We'll take a look at the actual changes in the J2SE platform in the next two sections -- one for the low-level APIs that work on individual code points, one for higher-level interfaces that work on character sequences.

Supplementary Characters in the Open: Code point-based APIs

The low-level APIs that were added fall into two broad categories: Methods that convert between various char and code point based representations, and methods that analyze or map code points.

The most basic conversion methods are Character.toCodePoint(char high, char low), which converts two UTF-16 code units to a code point, and Character.toChars(int codePoint), which converts the given code point to one or two UTF-16 code units, wrapped into a char[]. However, since text most of the time comes in the form of a character sequence, there are also codePointAt and codePointBefore methods to extract a code point from the various character sequence representations: Character.codePointAt(char[] a, int index) and String.codePointBefore(int index) are two typical examples. For the most common cases of inserting code points into a character sequence, there are appendCodePoint(int codePoint) methods for the StringBuffer and StringBuilder classes and a String constructor that takes an int[] representing code points.

A few methods that analyze code units and code points help in the conversion process: The isHighSurrogate and isLowSurrogate methods in the Character class identify the char values that are used to represent supplementary characters, and the charCount(int codePoint) method determines whether a code point needs to be converted to one or two chars.

But most code point-based methods perform functions for the complete range of Unicode characters that older char based methods performed for BMP characters. Here are some typical examples:

Character.isLetter(int codePoint) identifies letters according to the Unicode standard.
Character.isJavaIdentifierStart(int codePoint) determines whether a code point can start an identifier according to the Java Language Specification.
Character.UnicodeBlock.of(int codePoint) looks up the Unicode block that the code point belongs to.
Character.toUpperCase(int codePoint) converts the given code point to its uppercase equivalent. While this method does support supplementary characters, it still cannot work around the fundamental issue that some case conversions cannot be done correctly on a character-by-character basis. The German character "ß", for example, should be converted to "SS", which requires use of the String.toUpperCase method.

Note that most methods that accept a code point do not check whether the given int value is in the range of valid Unicode code points (as mentioned above, only the range from 0x0 to 0x10FFFF is valid). In most cases the value is produced in a way that guarantees it is valid, and checking it repeatedly in these low-level APIs might adversely affect system performance. In cases where validity cannot be guaranteed, applications must use the Character.isValidCodePoint method to make sure that the code point is valid. The behavior of most methods for invalid code points is intentionally unspecified and may vary between implementations.

The API contains a number of convenience methods which could be implemented using other, lower-level APIs, but where the expert group felt that the methods would be used sufficiently often that adding them to the J2SE platform made sense. However, the expert group also rejected some proposed convenience methods, which gives us an opportunity to show how you can implement such methods yourself. For example, the expert group debated and rejected a new constructor for the String class which would create a String holding a single code point. Here's a simple way that an application can provide the functionality using the existing API:

/**
 * Creates new String that contains just the given code point.
 */
String newString(int codePoint) {
    return new String(Character.toChars(codePoint));
}

You'll notice that in this simple implementation the toChars method always creates an intermediate array, which is used once and then immediately discarded. If the method shows up in your performance measurements, you may want to optimize for the very, very, very common case where the code point is a BMP character:

/**
 * Creates new String that contains just the given code point.
 * Version that optimizes for BMP characters.
 */
String newString(int codePoint) {
    if (Character.charCount(codePoint) == 1) {
        return String.valueOf((char) codePoint);
    } else {
        return new String(Character.toChars(codePoint));
    }
}

Or, if you need to create a large number of such strings, you may want to write a bulk version that reuses the array used by the toChars method:

/**
 * Creates new Strings each of which contains one of the given
 * code points.
 * Version that optimizes for BMP characters.
 */
String[] newStrings(int[] codePoints) {
    String[] result = new String[codePoints.length];
    char[] codeUnits = new char[2];
    for (int i = 0; i < codePoints.length; i++) { 
        int count = Character.toChars(codePoints[i], codeUnits, 0); 
        result[i] = new String(codeUnits, 0, count);
    }
    return result;
}

However, it may turn out that you actually want an entirely different solution. The new constructor String(int codePoint) was actually proposed as a code point-based alternative to String.valueOf(char). In many cases this method is used in the context of message generation, such as:

System.out.println("Character " + String.valueOf(char) + " is invalid.");

The new formatting API, which supports supplementary characters, provides a much simpler alternative:

System.out.printf("Character %c is invalid.%n", codePoint);

Using this higher-level API is not only simpler, it has additional advantages: It avoids the concatenation, which would make the message very hard to localize, and reduces the number of strings that need to be moved into a resource bundle from two to one.

Supplementary Characters under the Hood: Functionality Enhancements

Most changes in the Java 2 Platform that enable the use of supplementary characters are not reflected in new API. The general expectation is that all interfaces that handle character sequences will handle supplementary characters in a way that's appropriate for their functionality. This section highlights some enhancements that were made to meet this expectation.

Identifiers in the Java Programming Language

The Java Language Specification specifies that all Unicode letters and digits can be used in identifiers. Many supplementary characters are letters or digits, and so the Java Language Specification was updated to refer to new code point-based methods to define the legal characters in identifiers. The javac compiler and other tools that need to detect identifiers were changed to use these new methods.

Supplementary Character Support in Libraries

Numerous J2SE libraries have been enhanced to support supplementary characters through existing interfaces. Here are just a few examples:

String case conversion has been updated to handle supplementary characters and also to implement the special casing rules as specified in the Unicode standard.
The java.util.regex package has been updated so that both pattern strings and target strings can contain supplementary characters, which will be handled as complete units.
Collation in the java.text package now treats supplementary characters as complete units.
The java.text.Bidi class has been updated to handle supplementary characters and other characters that are new in Unicode 4.0. Note that the supplementary characters in the Cypriot Syllabary block have right-to-left directionality.
Font rendering and printing in the Java 2D API have been enhanced to correctly render and measure strings containing supplementary characters.
The Swing text component implementation has been updated to handle text that contains supplementary characters.

Character Conversion

There are only a small number of character encodings that can represent supplementary characters. In the case of Unicode-based encodings such as UTF-8 and UTF-16LE, the character converters in previous releases of the J2RE already implemented the conversions in a way that handled supplementary characters correctly. For J2RE 5.0, the converters for other encodings that can represent supplementary characters have been updated: GB18030, x-EUC-TW (which now implements all planes of CNS 11643), Big5-HKSCS (which now implements HKSCS-2001).

Representing Supplementary Characters in Source Files

In Java programming language source files, supplementary characters are easiest to use if a character encoding is used that can represent supplementary characters directly. UTF-8 is an excellent choice. For cases where the character encoding used cannot represent the characters directly, the Java programming language provides a Unicode escape syntax. This syntax has not been enhanced to express supplementary characters directly. Instead, they are represented by the two consecutive Unicode escapes for the two code units in the UTF-16 representation of the character. For example, the character U+20000 is written as "\uD840\uDC00". You probably don't want to figure out these escape sequences yourself; it's best to write in an encoding that supports the supplementary characters that you need and then use a tool such as native2ascii to convert to escape sequences.

Properties files, unfortunately, are still limited to ISO 8859-1 as their encoding (unless your application uses the new XML format). This means you always have to use escape sequences for supplementary characters, and again probably will want to write in a different encoding and then convert with a tool such as native2ascii.

Modified UTF-8

Modified UTF-8 is not new to the Java platform, but it's something that application developers need to be more aware of when converting text that might contain supplementary characters to and from UTF-8. The main thing to remember is that some J2SE interfaces use an encoding that's similar to UTF-8 but incompatible with it. This encoding has in the past sometimes been called "Java modified UTF-8" or (incorrectly) just "UTF-8". For J2SE 5.0, the documentation is being updated to uniformly call it "modified UTF-8."

The incompatibility between modified UTF-8 and standard UTF-8 stems from two differences. First, modified UTF-8 represents the character U+0000 as the two-byte sequence 0xC0 0x80, whereas standard UTF-8 uses the single byte value 0x0. Second, modified UTF-8 represents supplementary characters by separately encoding the two surrogate code units of their UTF-16 representation. Each of the surrogate code units is represented by three bytes, for a total of six bytes. Standard UTF-8, on the other hand, uses a single four byte sequence for the complete character.

Modified UTF-8 is used by the Java Virtual Machine and the interfaces attached to it (such as the Java Native Interface, the various tool interfaces, or Java class files), in the java.io.DataInput and DataOutput interfaces and classes implementing or using them, and for serialization. The Java Native Interface provides routines that convert to and from modified UTF-8. Standard UTF-8, on the other hand, is supported by the String class, by the java.io.InputStreamReader and OutputStreamWriter classes, the java.nio.charset facilities, and many APIs layered on top of them.

Since modified UTF-8 is incompatible with standard UTF-8, it is critical not to use one where the other is needed. Modified UTF-8 can only be used with the Java interfaces described above. In all other cases, in particular for data streams that may come from or may be interpreted by software that's not based on the Java platform, standard UTF-8 must be used. The Java Native Interface routines that convert to and from modified UTF-8 cannot be used when standard UTF-8 is required.

Supporting Supplementary Characters in Your Application

Now, the question that matters most to most readers: What changes do you have to make to your application in order to support supplementary characters?

The answer depends on what kind of text processing is done within the application and which Java platform APIs are used.

Applications that deal with text only in the form of char sequences in all forms ( char[], implementations of java.lang.CharSequence, implementations of java.text.CharacterIterator), and only use Java APIs that accept and return such char sequences, will likely not have to make any changes. The implementation of the Java platform APIs should handle supplementary characters for you.

Applications that interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, need to consider the valid values for these characters. In many cases it turns out that support for supplementary characters is not required. For example, if an application scans a char sequence for HTML tags, checking each char individually, it knows that these tags only use characters from the Basic Latin block. If the text being scanned contains supplementary characters, then these characters cannot be confused with the tag characters, because UTF-16 represents supplementary characters using code units whose values are not used for BMP characters.

Only where applications interpret individual characters themselves, pass individual characters to Java platform APIs, or call methods that return individual characters, and these character can be supplementary characters, does the application have to be changed. Where parallel APIs are available that use char sequences, it is best to convert to use such APIs. In the remaining cases, it will be necessary to use the new API to convert between char and code point-based representations, and call code point-based APIs. Unless, of course, you're lucky and find that there are newer and more convenient APIs in J2SE 5.0 that let you support supplementary characters and simplify your code at the same time, as in the formatting sample above.

You might wonder whether it's better to convert all text into code point representation (say, an int[]) and process it in that representation, or whether it's better to stick with char sequences most of the time and only convert to code points when needed. Well, the Java platform APIs in general certainly have a preference for char sequences, and using them will also save memory space.

For applications that need conversion to and from UTF-8 you will also need to consider carefully whether standard or modified UTF-8 is required, and use the proper Java platform facilities for each. The section " Modified UTF-8" provides the information needed to choose the right one.

Testing Your Application With Supplementary Characters

Whether the previous section led you to revise your application or not, it's always a good idea to test that it behaves correctly. For applications that don't include a graphical user interface, the information on " Representing Supplementary Characters in Source Files" helps in developing test cases. Here's additional information on testing with graphical user interfaces.

For text input, the Java 2 SDK provides a code point input method which accepts strings of the form "\Uxxxxxx", where the uppercase "U" indicates that the escape sequence contains six hexadecimal digits, thus allowing for supplementary characters. A lowercase "u" indicates the original form of the escape sequences, "\uxxxx". You can find this input method and its documentation in the directory demo/jfc/CodePointIM of the J2SDK.

For font rendering, you will need a font that can render at least some supplementary characters. One such font is James Kass's Code2001 font, which provides glyphs for scripts such as Deseret and Old Italic. Thanks to a new feature in the Java 2D library, you can simply install the font into the J2RE's lib/fonts/fallback directory, and it will be automatically added to all logical fonts used in 2D and XAWT rendering - you don't need to edit font configuration files.

And with that, you can call your application ready for supplementary characters!

Conclusion

Support for supplementary characters has been introduced into the Java platform with an approach that enables most applications to handle these characters without code changes. Applications that interpret individual characters can use new code point-based API in the Character class and various CharSequence subclasses.

Acknowledgments

Supplementary character support in the Java platform was designed by the JSR-204 expert group within the Java Community Process. The specification leads are Masayoshi Okutsu and Brian Beck (Sun Microsystems), the other members of the expert group are Craig Cummings (Oracle), Mark Davis (IBM), Markus Eble (SAP AG), Jere Käpyaho (Nokia Corp.), Kazuhiro Kazama (NTT), Kenji Kazumura (Fujitsu Limited), Eiichi Kimura (NEC Corp.), Changshin Lee (Tmax Soft Inc.), and Toshiki Murata (Oki Electric Industry Co.). The reference implementation was done by the Java Internationalization team at Sun Microsystems with contributions from the IBM Globalization Center of Competency, San José. The technology compatibility kit for the specification is the Java Compatibility Kit, implemented by the JCK Team at Sun Microsystems.

References

Masayoshi Okutsu, Brian Beck (ed.): Unicode Supplementary Character Support. Proposed Final Draft. Sun Microsystems, 2004.

Java 2 Platform Standard Edition 5.0 API Specification. Sun Microsystems, 2004.

The Unicode Consortium: The Unicode Standard, Version 4.0. Addison-Wesley, 2003.

Ken Whistler, Mark Davis: Character Encoding Model. Unicode Technical Report #17. The Unicode Consortium, 2000.

About the Authors

Norbert Lindenberg is the technical lead for Java Internationalization in Sun Microsystems' Java Web Services group. Before joining Sun, he worked on a variety of internationalization projects at General Magic and Apple Computer. He holds an M.S. degree in Computer Science from Universität Karlsruhe, Germany.

Masayoshi Okutsu is an internationalization engineer in Sun Microsystems' Java Web Services group, and currently the specification lead for Java Specification Request 204, Unicode Supplementary Character Support. Before joining Sun Microsystems, he worked on a variety of internationalization projects at Digital Equipment Corporation. He holds a B.S. degree in Electronic Engineering from Yamagata University, Japan.

¹ As used on this web site, the terms "Java virtual machine" or "JVM" mean a virtual machine for the Java platform.