💻 Unicode in Code

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working with supplementary characters outside the BMP. This guide explains how Java handles Unicode strings, the difference between char and code points, and best practices for internationalized Java applications.

·

Java was one of the first mainstream languages to commit to Unicode from its inception. When Java 1.0 shipped in 1996, the char type was defined as a 16-bit unsigned integer representing a Unicode character -- a bold design choice at a time when most languages treated characters as single bytes. That decision shaped Java's string handling for decades, and understanding its implications is essential for any Java developer who works with international text, emoji, or characters outside the Basic Multilingual Plane.

The char Type and UTF-16

Java's char is a 16-bit type that holds a single UTF-16 code unit:

char letter = 'A';           // U+0041
char euro   = '\u20AC';      // U+20AC EURO SIGN
char sigma  = '\u03C3';      // U+03C3 GREEK SMALL LETTER SIGMA

When Unicode was young, 65,536 code points seemed sufficient for every living script. Java's designers mapped char directly to a Unicode code point. Unicode 2.0 (1996) expanded the code space to over one million code points by introducing supplementary planes. Characters above U+FFFF -- including most emoji, historic scripts, and rare CJK ideographs -- cannot fit in a single char. Instead, they are encoded as surrogate pairs: two char values that together represent one code point.

// The emoji 🎵 (U+1F3B5 MUSICAL NOTE) requires a surrogate pair
String music = "\uD83C\uDFB5";    // surrogate pair
String music2 = "🎵";              // same thing, source-level literal
System.out.println(music.length());           // 2  (two char values)
System.out.println(music.codePointCount(0, music.length()));  // 1

This distinction between char count and code point count is the single biggest source of Unicode bugs in Java.

Strings: char[] Under the Hood

A String in Java is a sequence of char values -- that is, a sequence of UTF-16 code units, not a sequence of Unicode code points. The familiar methods length(), charAt(), and substring() all operate on char units:

Method Returns
length() Number of char values (UTF-16 code units)
charAt(i) The char at index i
codePointAt(i) The full code point starting at char index i
codePointCount(begin, end) Number of Unicode code points in the range
offsetByCodePoints(index, n) char index that is n code points from index

BMP Characters (U+0000 to U+FFFF)

For characters within the Basic Multilingual Plane, length() and code point count are identical. Most Latin, Greek, Cyrillic, CJK, and Arabic text falls in this range:

String hello = "こんにちは";
System.out.println(hello.length());                          // 5
System.out.println(hello.codePointCount(0, hello.length())); // 5

Supplementary Characters (U+10000 and Above)

Emoji, musical symbols, mathematical alphanumerics, and historic scripts use supplementary code points. Each one occupies two char slots:

String emoji = "Hello 🌍🌎🌏";
System.out.println(emoji.length());                          // 12
// "Hello " = 6 chars, each globe = 2 chars → 6 + 6 = 12
System.out.println(emoji.codePointCount(0, emoji.length())); // 9

Iterating Over Code Points

Never iterate over a string with charAt() if it may contain supplementary characters. Use codePoints() instead:

String text = "A\uD835\uDD38Z";  // A, 𝔸 (U+1D538 MATHEMATICAL DOUBLE-STRUCK A), Z

// WRONG: iterates over char values
for (int i = 0; i < text.length(); i++) {
    System.out.printf("char[%d] = %04X%n", i, (int) text.charAt(i));
}
// char[0] = 0041
// char[1] = D835  ← high surrogate (not a real character)
// char[2] = DD38  ← low surrogate
// char[3] = 005A

// CORRECT: iterates over code points
text.codePoints().forEach(cp ->
    System.out.printf("U+%04X %s%n", cp, Character.getName(cp))
);
// U+0041 LATIN CAPITAL LETTER A
// U+1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A
// U+005A LATIN CAPITAL LETTER Z

The codePoints() method returns an IntStream of code points, which works seamlessly with Java's stream API.

The Character Class

java.lang.Character provides Unicode-aware classification and conversion for individual code points. Most methods have two overloads: one taking char (limited to the BMP) and one taking int (the full Unicode range). Always prefer the int overloads:

int cp = 0x1F600;  // 😀 GRINNING FACE

Character.isLetter(cp);         // false (emoji is not a letter)
Character.isDigit(cp);          // false
Character.getType(cp);          // Character.OTHER_SYMBOL (28)
Character.getName(cp);          // "GRINNING FACE"
Character.charCount(cp);        // 2 (needs a surrogate pair)
Character.isBmpCodePoint(cp);   // false
Character.isSupplementaryCodePoint(cp);  // true

Case Conversion

Character.toUpperCase('σ');      // 'Σ'
Character.toLowerCase('Σ');      // 'σ'

// For locale-aware case conversion, use String methods:
"straße".toUpperCase(Locale.GERMAN);  // "STRASSE"
"I".toLowerCase(Locale.forLanguageTag("tr"));  // "ı" (Turkish dotless i)

The Turkish locale is notorious for breaking naive case-insensitive comparisons because I lowercases to ı (not i) and i uppercases to İ (not I).

Unicode Escapes and String Literals

Java supports several ways to embed Unicode characters in source code:

// \\u escape (processed by the compiler before parsing)
String s1 = "\u00E9";            // é

// Supplementary via surrogate pair
String s2 = "\uD83D\uDE00";     // 😀

// Direct source literal (if file is UTF-8)
String s3 = "é";
String s4 = "😀";

Warning: Java \u escapes are processed at the lexer level, before the compiler parses the code. This means \u000A is treated as a newline, and \u0022 is treated as a double quote, which can cause subtle bugs:

// This is a compilation error because \\u000A becomes a literal newline:
// String bad = "line1\u000Aline2";
// Use \\n instead:
String ok = "line1\nline2";

Normalization

Java provides java.text.Normalizer for Unicode normalization:

import java.text.Normalizer;

String composed   = "\u00E9";       // é (NFC, 1 code point)
String decomposed = "e\u0301";      // é (NFD, 2 code points)

System.out.println(composed.equals(decomposed));  // false

String nfc = Normalizer.normalize(decomposed, Normalizer.Form.NFC);
System.out.println(composed.equals(nfc));          // true
Form Description
NFC Canonical decomposition + canonical composition (recommended)
NFD Canonical decomposition
NFKC Compatibility decomposition + composition
NFKD Compatibility decomposition

Always normalize user input to NFC before storing or comparing.

Encoding and Byte Conversion

Java strings are always UTF-16 internally, but converting to/from byte arrays requires specifying an encoding:

import java.nio.charset.StandardCharsets;

String text = "日本語";
byte[] utf8  = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16 = text.getBytes(StandardCharsets.UTF_16);

// Convert back
String back = new String(utf8, StandardCharsets.UTF_8);

Always use StandardCharsets constants instead of string names to avoid UnsupportedEncodingException at runtime.

Regular Expressions

Java's java.util.regex package is Unicode-aware. Use \p{...} for Unicode property escapes:

import java.util.regex.*;

// Match any Unicode letter
Pattern letters = Pattern.compile("\\p{L}+");
Matcher m = letters.matcher("café 日本語");
while (m.find()) {
    System.out.println(m.group());   // "café", "日本語"
}

// Match any Unicode digit
Pattern.compile("\\p{Nd}+");        // decimal digits from any script

// Match a specific script
Pattern.compile("\\p{IsGreek}+");   // Greek characters
Pattern.compile("\\p{IsCyrillic}+");// Cyrillic characters

// UNICODE_CHARACTER_CLASS flag (Java 7+)
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
// \\w now matches Unicode letters, not just [a-zA-Z0-9_]

Collation and Sorting

Code-point order is rarely the correct sort order for human-readable text. Use java.text.Collator for locale-aware sorting:

import java.text.Collator;
import java.util.*;

Collator collator = Collator.getInstance(Locale.GERMAN);
List<String> words = Arrays.asList("Zug", "Äpfel", "Apfel");
words.sort(collator);
System.out.println(words);  // [Apfel, Äpfel, Zug]  (German rules: Ä ≈ A)

Common Pitfalls

1. Using length() to Count Characters

// WRONG: counts char values
"Hello 🌍".length()            // 8  (not 7)

// CORRECT: counts code points
"Hello 🌍".codePointCount(0, "Hello 🌍".length())  // 7

2. Truncating Strings That Contain Supplementary Characters

Cutting at an arbitrary char index can split a surrogate pair:

String text = "Hi 🎉 there";
// WRONG: may split a surrogate pair
String bad = text.substring(0, 4);   // "Hi \uD83C"  ← broken!

// SAFE: advance by code points
int end = text.offsetByCodePoints(0, 4);
String safe = text.substring(0, end); // "Hi 🎉"

3. Comparing Without Normalization

String a = "\u00F1";       // ñ (precomposed)
String b = "n\u0303";      // ñ (n + combining tilde)
a.equals(b);               // false  ← same visual character!

Always normalize before comparing.

Quick Reference

Task Code
Code point count s.codePointCount(0, s.length())
Iterate code points s.codePoints().forEach(...)
Code point at index s.codePointAt(i)
Code point to string new String(Character.toChars(cp))
Character name Character.getName(cp)
Is supplementary Character.isSupplementaryCodePoint(cp)
Normalize NFC Normalizer.normalize(s, Normalizer.Form.NFC)
Encode to UTF-8 s.getBytes(StandardCharsets.UTF_8)
Locale-aware sort Collator.getInstance(locale)
Unicode regex Pattern.compile("\\p{L}+")

Java's UTF-16 foundation means that supplementary characters require constant vigilance. The good news is that the codePoints() stream, Character methods with int parameters, and Normalizer provide all the tools you need. The key discipline is to always think in code points, not char values, and to test with supplementary characters early in development.

More in Unicode in Code