💻 Unicode in Code

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working with supplementary characters outside the BMP. This guide explains how Java handles Unicode strings, the difference between char and code points, and best practices for internationalized Java applications.

Published 2022-02-21 · Updated 2024-08-26

Java was one of the first mainstream languages to commit to Unicode from its inception. When Java 1.0 shipped in 1996, the char type was defined as a 16-bit unsigned integer representing a Unicode character -- a bold design choice at a time when most languages treated characters as single bytes. That decision shaped Java's string handling for decades, and understanding its implications is essential for any Java developer who works with international text, emoji, or characters outside the Basic Multilingual Plane.

The char Type and UTF-16

Java's char is a 16-bit type that holds a single UTF-16 code unit:

char letter = 'A';           // U+0041
char euro   = '\u20AC';      // U+20AC EURO SIGN
char sigma  = '\u03C3';      // U+03C3 GREEK SMALL LETTER SIGMA

When Unicode was young, 65,536 code points seemed sufficient for every living script. Java's designers mapped char directly to a Unicode code point. Unicode 2.0 (1996) expanded the code space to over one million code points by introducing supplementary planes. Characters above U+FFFF -- including most emoji, historic scripts, and rare CJK ideographs -- cannot fit in a single char. Instead, they are encoded as surrogate pairs: two char values that together represent one code point.

// The emoji 🎵 (U+1F3B5 MUSICAL NOTE) requires a surrogate pair
String music = "\uD83C\uDFB5";    // surrogate pair
String music2 = "🎵";              // same thing, source-level literal
System.out.println(music.length());           // 2  (two char values)
System.out.println(music.codePointCount(0, music.length()));  // 1

This distinction between char count and code point count is the single biggest source of Unicode bugs in Java.

Strings: char[] Under the Hood

A String in Java is a sequence of char values -- that is, a sequence of UTF-16 code units, not a sequence of Unicode code points. The familiar methods length(), charAt(), and substring() all operate on char units:

Method	Returns
`length()`	Number of `char` values (UTF-16 code units)
`charAt(i)`	The `char` at index `i`
`codePointAt(i)`	The full code point starting at `char` index `i`
`codePointCount(begin, end)`	Number of Unicode code points in the range
`offsetByCodePoints(index, n)`	`char` index that is `n` code points from `index`

BMP Characters (U+0000 to U+FFFF)

For characters within the Basic Multilingual Plane, length() and code point count are identical. Most Latin, Greek, Cyrillic, CJK, and Arabic text falls in this range:

String hello = "こんにちは";
System.out.println(hello.length());                          // 5
System.out.println(hello.codePointCount(0, hello.length())); // 5

Supplementary Characters (U+10000 and Above)

Emoji, musical symbols, mathematical alphanumerics, and historic scripts use supplementary code points. Each one occupies two char slots:

String emoji = "Hello 🌍🌎🌏";
System.out.println(emoji.length());                          // 12
// "Hello " = 6 chars, each globe = 2 chars → 6 + 6 = 12
System.out.println(emoji.codePointCount(0, emoji.length())); // 9

Iterating Over Code Points

Never iterate over a string with charAt() if it may contain supplementary characters. Use codePoints() instead:

String text = "A\uD835\uDD38Z";  // A, 𝔸 (U+1D538 MATHEMATICAL DOUBLE-STRUCK A), Z

// WRONG: iterates over char values
for (int i = 0; i < text.length(); i++) {
    System.out.printf("char[%d] = %04X%n", i, (int) text.charAt(i));
}
// char[0] = 0041
// char[1] = D835  ← high surrogate (not a real character)
// char[2] = DD38  ← low surrogate
// char[3] = 005A

// CORRECT: iterates over code points
text.codePoints().forEach(cp ->
    System.out.printf("U+%04X %s%n", cp, Character.getName(cp))
);
// U+0041 LATIN CAPITAL LETTER A
// U+1D538 MATHEMATICAL DOUBLE-STRUCK CAPITAL A
// U+005A LATIN CAPITAL LETTER Z

The codePoints() method returns an IntStream of code points, which works seamlessly with Java's stream API.

The Character Class

java.lang.Character provides Unicode-aware classification and conversion for individual code points. Most methods have two overloads: one taking char (limited to the BMP) and one taking int (the full Unicode range). Always prefer the int overloads:

int cp = 0x1F600;  // 😀 GRINNING FACE

Character.isLetter(cp);         // false (emoji is not a letter)
Character.isDigit(cp);          // false
Character.getType(cp);          // Character.OTHER_SYMBOL (28)
Character.getName(cp);          // "GRINNING FACE"
Character.charCount(cp);        // 2 (needs a surrogate pair)
Character.isBmpCodePoint(cp);   // false
Character.isSupplementaryCodePoint(cp);  // true

Case Conversion

Character.toUpperCase('σ');      // 'Σ'
Character.toLowerCase('Σ');      // 'σ'

// For locale-aware case conversion, use String methods:
"straße".toUpperCase(Locale.GERMAN);  // "STRASSE"
"I".toLowerCase(Locale.forLanguageTag("tr"));  // "ı" (Turkish dotless i)

The Turkish locale is notorious for breaking naive case-insensitive comparisons because I lowercases to ı (not i) and i uppercases to İ (not I).

Unicode Escapes and String Literals

Java supports several ways to embed Unicode characters in source code:

// \\u escape (processed by the compiler before parsing)
String s1 = "\u00E9";            // é

// Supplementary via surrogate pair
String s2 = "\uD83D\uDE00";     // 😀

// Direct source literal (if file is UTF-8)
String s3 = "é";
String s4 = "😀";

Warning: Java \u escapes are processed at the lexer level, before the compiler parses the code. This means \u000A is treated as a newline, and \u0022 is treated as a double quote, which can cause subtle bugs:

// This is a compilation error because \\u000A becomes a literal newline:
// String bad = "line1\u000Aline2";
// Use \\n instead:
String ok = "line1\nline2";

Normalization

Java provides java.text.Normalizer for Unicode normalization:

import java.text.Normalizer;

String composed   = "\u00E9";       // é (NFC, 1 code point)
String decomposed = "e\u0301";      // é (NFD, 2 code points)

System.out.println(composed.equals(decomposed));  // false

String nfc = Normalizer.normalize(decomposed, Normalizer.Form.NFC);
System.out.println(composed.equals(nfc));          // true

Form	Description
`NFC`	Canonical decomposition + canonical composition (recommended)
`NFD`	Canonical decomposition
`NFKC`	Compatibility decomposition + composition
`NFKD`	Compatibility decomposition

Always normalize user input to NFC before storing or comparing.

Encoding and Byte Conversion

Java strings are always UTF-16 internally, but converting to/from byte arrays requires specifying an encoding:

import java.nio.charset.StandardCharsets;

String text = "日本語";
byte[] utf8  = text.getBytes(StandardCharsets.UTF_8);
byte[] utf16 = text.getBytes(StandardCharsets.UTF_16);

// Convert back
String back = new String(utf8, StandardCharsets.UTF_8);

Always use StandardCharsets constants instead of string names to avoid UnsupportedEncodingException at runtime.

Regular Expressions

Java's java.util.regex package is Unicode-aware. Use \p{...} for Unicode property escapes:

import java.util.regex.*;

// Match any Unicode letter
Pattern letters = Pattern.compile("\\p{L}+");
Matcher m = letters.matcher("café 日本語");
while (m.find()) {
    System.out.println(m.group());   // "café", "日本語"
}

// Match any Unicode digit
Pattern.compile("\\p{Nd}+");        // decimal digits from any script

// Match a specific script
Pattern.compile("\\p{IsGreek}+");   // Greek characters
Pattern.compile("\\p{IsCyrillic}+");// Cyrillic characters

// UNICODE_CHARACTER_CLASS flag (Java 7+)
Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
// \\w now matches Unicode letters, not just [a-zA-Z0-9_]

Collation and Sorting

Code-point order is rarely the correct sort order for human-readable text. Use java.text.Collator for locale-aware sorting:

import java.text.Collator;
import java.util.*;

Collator collator = Collator.getInstance(Locale.GERMAN);
List<String> words = Arrays.asList("Zug", "Äpfel", "Apfel");
words.sort(collator);
System.out.println(words);  // [Apfel, Äpfel, Zug]  (German rules: Ä ≈ A)

Common Pitfalls

1. Using length() to Count Characters

// WRONG: counts char values
"Hello 🌍".length()            // 8  (not 7)

// CORRECT: counts code points
"Hello 🌍".codePointCount(0, "Hello 🌍".length())  // 7

2. Truncating Strings That Contain Supplementary Characters

Cutting at an arbitrary char index can split a surrogate pair:

String text = "Hi 🎉 there";
// WRONG: may split a surrogate pair
String bad = text.substring(0, 4);   // "Hi \uD83C"  ← broken!

// SAFE: advance by code points
int end = text.offsetByCodePoints(0, 4);
String safe = text.substring(0, end); // "Hi 🎉"

3. Comparing Without Normalization

String a = "\u00F1";       // ñ (precomposed)
String b = "n\u0303";      // ñ (n + combining tilde)
a.equals(b);               // false  ← same visual character!

Always normalize before comparing.

Quick Reference

Task	Code
Code point count	`s.codePointCount(0, s.length())`
Iterate code points	`s.codePoints().forEach(...)`
Code point at index	`s.codePointAt(i)`
Code point to string	`new String(Character.toChars(cp))`
Character name	`Character.getName(cp)`
Is supplementary	`Character.isSupplementaryCodePoint(cp)`
Normalize NFC	`Normalizer.normalize(s, Normalizer.Form.NFC)`
Encode to UTF-8	`s.getBytes(StandardCharsets.UTF_8)`
Locale-aware sort	`Collator.getInstance(locale)`
Unicode regex	`Pattern.compile("\\p{L}+")`

Java's UTF-16 foundation means that supplementary characters require constant vigilance. The good news is that the codePoints() stream, Character methods with int parameters, and Normalizer provide all the tools you need. The key discipline is to always think in code points, not char values, and to test with supplementary characters early in development.

Thêm trong Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Quay lại Hướng dẫn