编程与开发

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary characters, use codePointAt() and Character.toChars(). Java's \uXXXX escapes process at compile time.

What is Java Unicode Handling?

Java's approach to Unicode reflects the historical evolution of the Unicode Standard. Java was designed in the mid-1990s when Unicode was a 16-bit encoding, and this assumption was baked into the language's char type and String class. When Unicode later expanded to 21 bits to accommodate scripts like Egyptian Hieroglyphs and emoji, Java had to retrofit support for code points beyond U+FFFF — the supplementary characters.

The char Type: 16-bit BMP Only

Java's char is a 16-bit unsigned integer representing a UTF-16 code unit, not a full Unicode code point. This means char can represent only characters in the Basic Multilingual Plane (U+0000 to U+FFFF). Characters outside this range — everything from U+10000 onward — require two char values called a surrogate pair.

char c = '\u0041';  // 'A' — works fine, BMP character
// char cannot hold U+1F600 (😀) — it requires a surrogate pair

String as UTF-16

Java's String class stores its content as a UTF-16 sequence. For most Western text this is transparent, but supplementary characters produce a String where the visual character count differs from length():

String emoji = "😀";          // U+1F600 GRINNING FACE
emoji.length();               // 2 — two UTF-16 code units (surrogate pair)
emoji.codePointCount(0, emoji.length());  // 1 — one Unicode code point

codePointAt() vs charAt()

The distinction between code-unit-based and code-point-based iteration is the central Java Unicode challenge:

String s = "A😀B";

// charAt() — returns UTF-16 code units
s.charAt(0);   // 'A'
s.charAt(1);   // '\uD83D' — high surrogate, NOT a printable character
s.charAt(2);   // '\uDE00' — low surrogate

// codePointAt() — returns full Unicode code points
s.codePointAt(0);  // 65 (A)
s.codePointAt(1);  // 128512 (😀, U+1F600)
s.codePointAt(3);  // 66 (B)

For correct supplementary-character-aware iteration, use String.codePoints() in Java 8+:

s.codePoints().forEach(cp ->
    System.out.println(new String(Character.toChars(cp))));

The Character Class

Character provides static utility methods for Unicode properties. Since Java 5, many methods have overloaded versions that accept int code points (not just char):

Character.isLetter('A');                   // true
Character.isLetter(0x1F600);               // false (emoji, not a letter)
Character.getType('A');                    // Character.UPPERCASE_LETTER
Character.toUpperCase(0x0073);             // 0x0053 ('S')
Character.isSupplementaryCodePoint(0x1F600); // true

Pattern Matching and Unicode

Regular expressions in Java require the Pattern.UNICODE_CHARACTER_CLASS flag to make \w, \d, \s match Unicode categories rather than just ASCII:

Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
p.matcher("Héllo").matches();  // true — accented letter is a word char

Quick Facts

Feature	Detail
`char` size	16 bits (BMP only, U+0000–U+FFFF)
`String` encoding	UTF-16 internally
Supplementary chars	Surrogate pairs (two `char` values)
Code point API	`codePointAt()`, `codePoints()`, `offsetByCodePoints()`
`Character` utility	`isLetter()`, `getType()`, `toUpperCase()` (int overloads)
Regex Unicode flag	`Pattern.UNICODE_CHARACTER_CLASS`
Normalization	`java.text.Normalizer` (NFC, NFD, NFKC, NFKD)
Collation	`java.text.Collator`, `java.text.RuleBasedCollator`

编程与开发中的更多内容

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

Unicode 正则表达式

使用Unicode属性的正则表达式模式：\p{L}（任意字母）、\p{Script=Greek}（希腊文字）、\p{Emoji}，各语言和正则引擎的支持程度不同。

Unicode 转义序列

在源代码中表示Unicode字符的语法，各语言不同：\u2713（Python/Java/JS）、\u{2713}（JS/Ruby/Rust）、\U00012345（Python/C）。

不可见字符

无可见字形的字符：空白、零宽字符、控制字符和格式字符，可能引发欺骗和文本隐写等安全问题。

乱码

代理对

在UTF-16中一起编码补充字符的两个16位码元（高代理U+D800–U+DBFF + 低代理U+DC00–U+DFFF），😀 = D83D DE00。

字符串

编程语言中的字符序列，内部表示各异：UTF-8（Go、Rust、新版Python）、UTF-16（Java、JavaScript、C#）或UTF-32（Python）。

字符串长度歧义

Unicode字符串的“长度”取决于计量单位：码元（JavaScript .length）、码位（Python len()）或字素簇。👨‍👩‍👧‍👦 = 7个码位，1个字素。

替换字符

U+FFFD（�），解码器遇到无效字节序列时显示的字符——“解码出错”的通用符号。

← 返回词汇表