Unicode in Java
Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working with supplementary characters outside the BMP. This guide explains how Java handles Unicode strings, the difference between char and code points, and best practices for internationalized Java applications.
Java was one of the first mainstream languages to commit to Unicode from its
inception. When Java 1.0 shipped in 1996, the char type was defined as a
16-bit unsigned integer representing a Unicode character -- a bold design choice
at a time when most languages treated characters as single bytes. That decision
shaped Java's string handling for decades, and understanding its implications is
essential for any Java developer who works with international text, emoji, or
characters outside the Basic Multilingual Plane.
The char Type and UTF-16
Java's char is a 16-bit type that holds a single UTF-16 code unit:
char letter = 'A'; // U+0041
char euro = '\u20AC'; // U+20AC EURO SIGN
char sigma = '\u03C3'; // U+03C3 GREEK SMALL LETTER SIGMA
When Unicode was young, 65,536 code points seemed sufficient for every living
script. Java's designers mapped char directly to a Unicode code point.
Unicode 2.0 (1996) expanded the code space to over one million code points by
introducing supplementary planes. Characters above U+FFFF -- including most
emoji, historic scripts, and rare CJK ideographs -- cannot fit in a single
char. Instead, they are encoded as surrogate pairs: two char values
that together represent one code point.
// The emoji 🎵 (U+1F3B5 MUSICAL NOTE) requires a surrogate pair
String music = "\uD83C\uDFB5"; // surrogate pair
String music2 = "🎵"; // same thing, source-level literal
System.out.println(music.length()); // 2 (two char values)
System.out.println(music.codePointCount(0, music.length())); // 1
This distinction between char count and code point count is the single
biggest source of Unicode bugs in Java.
Strings: char[] Under the Hood
A String in Java is a sequence of char values -- that is, a sequence of
UTF-16 code units, not a sequence of Unicode code points. The familiar
methods length(), charAt(), and substring() all operate on char units:
| Method | Returns |
|---|---|
length() |
Number of char values (UTF-16 code units) |
charAt(i) |
The char at index i |
codePointAt(i) |
The full code point starting at char index i |
codePointCount(begin, end) |
Number of Unicode code points in the range |
offsetByCodePoints(index, n) |
char index that is n code points from index |
BMP Characters (U+0000 to U+FFFF)
For characters within the Basic Multilingual Plane, length() and code point
count are identical. Most Latin, Greek, Cyrillic, CJK, and Arabic text falls
in this range:
String hello = "こんにちは";
System.out.println(hello.length()); // 5
System.out.println(hello.codePointCount(0, hello.length())); // 5
Supplementary Characters (U+10000 and Above)
Emoji, musical symbols, mathematical alphanumerics, and historic scripts use
supplementary code points. Each one occupies two char slots:
String emoji = "Hello 🌍🌎🌏";
System.out.println(emoji.length()); // 12
// "Hello " = 6 chars, each globe = 2 chars → 6 + 6 = 12
System.out.println(emoji.codePointCount(0, emoji.length())); // 9
Iterating Over Code Points
Never iterate over a string with charAt() if it may contain supplementary
characters. Use codePoints() instead:
String text = "A\uD835\uDD38Z"; // A, 𝔸 (U+1D538 MATHEMATICAL DOUBLE-STRUCK A), Z
// WRONG: iterates over char values
for (int i = 0; i < text.length(); i++) {
System.out.printf("char[%d] = %04X%n", i, (int) text.charAt(i));
}
// char[0] = 0041
// char[1] = D835 ← high surrogate (not a real character)
// char[2] = DD38 ← low surrogate
// char[3] = 005A
// CORRECT: iterates over code points
text.codePoints().forEach(cp ->
System.out.printf("U+%04X %s%n", cp, Character.getName(cp))
);
// U+0041 LATIN CAPITAL LETTER A
// U+1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A
// U+005A LATIN CAPITAL LETTER Z
The codePoints() method returns an IntStream of code points, which works
seamlessly with Java's stream API.
The Character Class
java.lang.Character provides Unicode-aware classification and conversion for
individual code points. Most methods have two overloads: one taking char
(limited to the BMP) and one taking int (the full Unicode range). Always
prefer the int overloads:
int cp = 0x1F600; // 😀 GRINNING FACE
Character.isLetter(cp); // false (emoji is not a letter)
Character.isDigit(cp); // false
Character.getType(cp); // Character.OTHER_SYMBOL (28)
Character.getName(cp); // "GRINNING FACE"
Character.charCount(cp); // 2 (needs a surrogate pair)
Character.isBmpCodePoint(cp); // false
Character.isSupplementaryCodePoint(cp); // true
Case Conversion
Character.toUpperCase('σ'); // 'Σ'
Character.toLowerCase('Σ'); // 'σ'
// For locale-aware case conversion, use String methods:
"straße".toUpperCase(Locale.GERMAN); // "STRASSE"
"I".toLowerCase(Locale.forLanguageTag("tr")); // "ı" (Turkish dotless i)
The Turkish locale is notorious for breaking naive case-insensitive comparisons
because I lowercases to ı (not i) and i uppercases to İ (not I).
Unicode Escapes and String Literals
Java supports several ways to embed Unicode characters in source code:
// \\u escape (processed by the compiler before parsing)
String s1 = "\u00E9"; // é
// Supplementary via surrogate pair
String s2 = "\uD83D\uDE00"; // 😀
// Direct source literal (if file is UTF-8)
String s3 = "é";
String s4 = "😀";
Warning: Java \u escapes are processed at the lexer level, before the
compiler parses the code. This means \u000A is treated as a newline, and
\u0022 is treated as a double quote, which can cause subtle bugs:
// This is a compilation error because \\u000A becomes a literal newline:
// String bad = "line1\u000Aline2";
// Use \\n instead:
String ok = "line1\nline2";
Normalization
Java provides java.text.Normalizer for Unicode normalization:
import java.text.Normalizer;
String composed = "\u00E9"; // é (NFC, 1 code point)
String decomposed = "e\u0301"; // é (NFD, 2 code points)
System.out.println(composed.equals(decomposed)); // false
String nfc = Normalizer.normalize(decomposed, Normalizer.Form.NFC);
System.out.println(composed.equals(nfc)); // true
| Form | Description |
|---|---|
NFC |
Canonical decomposition + canonical composition (recommended) |
NFD |
Canonical decomposition |
NFKC |
Compatibility decomposition + composition |
NFKD |
Compatibility decomposition |
Always normalize user input to NFC before storing or comparing.
Encoding and Byte Conversion
Java strings are always UTF-16 internally, but converting to/from byte arrays requires specifying an encoding:
import java.nio.charset.StandardCharsets;
String text = "日本語";
byte[] utf8 = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16 = text.getBytes(StandardCharsets.UTF_16);
// Convert back
String back = new String(utf8, StandardCharsets.UTF_8);
Always use StandardCharsets constants instead of string names to avoid
UnsupportedEncodingException at runtime.
Regular Expressions
Java's java.util.regex package is Unicode-aware. Use \p{...} for Unicode
property escapes:
import java.util.regex.*;
// Match any Unicode letter
Pattern letters = Pattern.compile("\\p{L}+");
Matcher m = letters.matcher("café 日本語");
while (m.find()) {
System.out.println(m.group()); // "café", "日本語"
}
// Match any Unicode digit
Pattern.compile("\\p{Nd}+"); // decimal digits from any script
// Match a specific script
Pattern.compile("\\p{IsGreek}+"); // Greek characters
Pattern.compile("\\p{IsCyrillic}+");// Cyrillic characters
// UNICODE_CHARACTER_CLASS flag (Java 7+)
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
// \\w now matches Unicode letters, not just [a-zA-Z0-9_]
Collation and Sorting
Code-point order is rarely the correct sort order for human-readable text. Use
java.text.Collator for locale-aware sorting:
import java.text.Collator;
import java.util.*;
Collator collator = Collator.getInstance(Locale.GERMAN);
List<String> words = Arrays.asList("Zug", "Äpfel", "Apfel");
words.sort(collator);
System.out.println(words); // [Apfel, Äpfel, Zug] (German rules: Ä ≈ A)
Common Pitfalls
1. Using length() to Count Characters
// WRONG: counts char values
"Hello 🌍".length() // 8 (not 7)
// CORRECT: counts code points
"Hello 🌍".codePointCount(0, "Hello 🌍".length()) // 7
2. Truncating Strings That Contain Supplementary Characters
Cutting at an arbitrary char index can split a surrogate pair:
String text = "Hi 🎉 there";
// WRONG: may split a surrogate pair
String bad = text.substring(0, 4); // "Hi \uD83C" ← broken!
// SAFE: advance by code points
int end = text.offsetByCodePoints(0, 4);
String safe = text.substring(0, end); // "Hi 🎉"
3. Comparing Without Normalization
String a = "\u00F1"; // ñ (precomposed)
String b = "n\u0303"; // ñ (n + combining tilde)
a.equals(b); // false ← same visual character!
Always normalize before comparing.
Quick Reference
| Task | Code |
|---|---|
| Code point count | s.codePointCount(0, s.length()) |
| Iterate code points | s.codePoints().forEach(...) |
| Code point at index | s.codePointAt(i) |
| Code point to string | new String(Character.toChars(cp)) |
| Character name | Character.getName(cp) |
| Is supplementary | Character.isSupplementaryCodePoint(cp) |
| Normalize NFC | Normalizer.normalize(s, Normalizer.Form.NFC) |
| Encode to UTF-8 | s.getBytes(StandardCharsets.UTF_8) |
| Locale-aware sort | Collator.getInstance(locale) |
| Unicode regex | Pattern.compile("\\p{L}+") |
Java's UTF-16 foundation means that supplementary characters require constant
vigilance. The good news is that the codePoints() stream, Character methods
with int parameters, and Normalizer provide all the tools you need. The key
discipline is to always think in code points, not char values, and to test
with supplementary characters early in development.
Thêm trong Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …