Java Unicode
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary characters, use codePointAt() and Character.toChars(). Java's \uXXXX escapes process at compile time.
What is Java Unicode Handling?
Java's approach to Unicode reflects the historical evolution of the Unicode Standard. Java was designed in the mid-1990s when Unicode was a 16-bit encoding, and this assumption was baked into the language's char type and String class. When Unicode later expanded to 21 bits to accommodate scripts like Egyptian Hieroglyphs and emoji, Java had to retrofit support for code points beyond U+FFFF — the supplementary characters.
The char Type: 16-bit BMP Only
Java's char is a 16-bit unsigned integer representing a UTF-16 code unit, not a full Unicode code point. This means char can represent only characters in the Basic Multilingual Plane (U+0000 to U+FFFF). Characters outside this range — everything from U+10000 onward — require two char values called a surrogate pair.
char c = '\u0041'; // 'A' — works fine, BMP character
// char cannot hold U+1F600 (😀) — it requires a surrogate pair
String as UTF-16
Java's String class stores its content as a UTF-16 sequence. For most Western text this is transparent, but supplementary characters produce a String where the visual character count differs from length():
String emoji = "😀"; // U+1F600 GRINNING FACE
emoji.length(); // 2 — two UTF-16 code units (surrogate pair)
emoji.codePointCount(0, emoji.length()); // 1 — one Unicode code point
codePointAt() vs charAt()
The distinction between code-unit-based and code-point-based iteration is the central Java Unicode challenge:
String s = "A😀B";
// charAt() — returns UTF-16 code units
s.charAt(0); // 'A'
s.charAt(1); // '\uD83D' — high surrogate, NOT a printable character
s.charAt(2); // '\uDE00' — low surrogate
// codePointAt() — returns full Unicode code points
s.codePointAt(0); // 65 (A)
s.codePointAt(1); // 128512 (😀, U+1F600)
s.codePointAt(3); // 66 (B)
For correct supplementary-character-aware iteration, use String.codePoints() in Java 8+:
s.codePoints().forEach(cp ->
System.out.println(new String(Character.toChars(cp))));
The Character Class
Character provides static utility methods for Unicode properties. Since Java 5, many methods have overloaded versions that accept int code points (not just char):
Character.isLetter('A'); // true
Character.isLetter(0x1F600); // false (emoji, not a letter)
Character.getType('A'); // Character.UPPERCASE_LETTER
Character.toUpperCase(0x0073); // 0x0053 ('S')
Character.isSupplementaryCodePoint(0x1F600); // true
Pattern Matching and Unicode
Regular expressions in Java require the Pattern.UNICODE_CHARACTER_CLASS flag to make \w, \d, \s match Unicode categories rather than just ASCII:
Pattern p = Pattern.compile("\\w+", Pattern.UNICODE_CHARACTER_CLASS);
p.matcher("Héllo").matches(); // true — accented letter is a word char
Quick Facts
| Feature | Detail |
|---|---|
char size |
16 bits (BMP only, U+0000–U+FFFF) |
String encoding |
UTF-16 internally |
| Supplementary chars | Surrogate pairs (two char values) |
| Code point API | codePointAt(), codePoints(), offsetByCodePoints() |
Character utility |
isLetter(), getType(), toUpperCase() (int overloads) |
| Regex Unicode flag | Pattern.UNICODE_CHARACTER_CLASS |
| Normalization | java.text.Normalizer (NFC, NFD, NFKC, NFKD) |
| Collation | java.text.Collator, java.text.RuleBasedCollator |
관련 용어
프로그래밍 & 개발의 더 많은 용어
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
U+0000(NUL). 첫 번째 유니코드/ASCII 문자로, C/C++에서 문자열 종료자로 사용됩니다. 보안 위험: 널 …
U+FFFD(). 디코더가 유효하지 않은 바이트 시퀀스를 만났을 때 표시되는 문자 — '디코딩에 …
잘못된 인코딩으로 바이트를 디코딩할 때 생기는 깨진 텍스트. 일본어 용어(文字化け). 예: 'café'를 …
프로그래밍 언어에서 문자의 시퀀스. 내부 표현은 다양합니다: UTF-8(Go, Rust, 최신 Python), UTF-16(Java, …
유니코드 문자열의 '길이'는 단위에 따라 다릅니다: 코드 단위(JavaScript .length), 코드 포인트(Python len()), …
눈에 보이는 글리프가 없는 문자: 공백, 너비 없는 문자, 제어 문자, 서식 …
UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 …
소스 코드에서 유니코드 문자를 나타내는 구문. 언어마다 다릅니다: \u2713(Python/Java/JS), \u{2713}(JS/Ruby/Rust), \U00012345(Python/C).