유니코드 이스케이프 시퀀스
소스 코드에서 유니코드 문자를 나타내는 구문. 언어마다 다릅니다: \u2713(Python/Java/JS), \u{2713}(JS/Ruby/Rust), \U00012345(Python/C).
What Are Unicode Escape Sequences?
Unicode escape sequences are a notation for representing Unicode characters in source code using only ASCII characters. Instead of embedding the actual character (which may not be typeable or visible), you write a backslash-based sequence that the language parser converts to the character at compile or parse time.
The exact syntax varies by language, but two forms dominate:
\uXXXX: Four hex digits — covers the Basic Multilingual Plane (U+0000–U+FFFF).\UXXXXXXXX: Eight hex digits — covers all Unicode including supplementary planes (U+0000–U+10FFFF).
Language-by-Language Syntax
# Python
"\u00A9" # © (U+00A9, 4-digit BMP)
"\U0001F600" # 😀 (U+1F600, 8-digit supplementary)
"\N{SNOWMAN}" # ☃ (named character)
"\x00A9" # © (also valid: 2-digit hex byte)
# All identical:
"\u00A9" == "\U000000A9" == "©" # True
// JavaScript
"\u00A9" // © (BMP)
"\u{1F600}" // 😀 (ES6+ brace notation, any code point)
"\uD83D\uDE00" // 😀 (legacy: surrogate pair for supplementary)
// ES6 brace notation is recommended:
"\u{1F1FA}\u{1F1F8}" // 🇺🇸 (flag sequence)
// Java — only \uXXXX (BMP), surrogates needed for supplementary
"\u00A9" // ©
"\uD83D\uDE00" // 😀 (surrogate pair)
// Java has no \U syntax — use char literals or Character.toChars()
// C#
"\u00A9" // ©
"\U0001F600" // 😀
"\x00A9" // ©
// Rust
"\u{A9}" // © (brace notation, variable length)
"\u{1F600}" // 😀
// Go
"\u00A9" // © (rune literal, BMP)
"\U0001F600" // 😀 (rune literal, supplementary)
Java's Unusual Preprocessing
Java processes \uXXXX escapes during lexical preprocessing — before the tokenizer runs. This means a Unicode escape can appear in virtually any context including comments and string literals:
// The following comment contains a \u000A which is a newline!
// This will compile and affect the next line
int x = 1; // \u000A x = 2;
// Effectively parsed as:
// int x = 1;
// x = 2;
This is a subtle Java gotcha: Unicode escapes in comments can inject real source code.
Using Escapes in Practice
# When to use escapes:
# 1. In code that must be ASCII-safe
ARROW = "\u2192" # → RIGHT ARROW
# 2. For control characters
NULL = "\u0000" # NUL
LINE_SEP = "\u2028" # LINE SEPARATOR
# 3. For documentation clarity
ZWJ = "\u200D" # Zero Width Joiner — invisible in source
NBSP = "\u00A0" # Non-Breaking Space — invisible in source
# Named escapes (Python only) — most readable
import unicodedata
"\N{COPYRIGHT SIGN}" # ©
"\N{ZERO WIDTH JOINER}" # (ZWJ)
// ES6+ template literals with escapes
const message = `Copyright \u{A9} 2024 \u{2014} All rights reserved`;
// "Copyright © 2024 — All rights reserved"
Escape vs. Direct Character
In UTF-8 source files, direct characters are generally preferred for readability:
# Readable — direct character
emoji = "😀"
# ASCII-safe — escape (useful in legacy systems)
emoji = "\U0001F600"
# Both produce identical runtime values
"😀" == "\U0001F600" # True
Quick Facts
| Language | BMP syntax | Full range syntax |
|---|---|---|
| Python | \uXXXX |
\UXXXXXXXX or \N{name} |
| JavaScript | \uXXXX |
\u{XXXXX} (ES6+) |
| Java | \uXXXX |
Surrogate pairs only |
| C# | \uXXXX |
\UXXXXXXXX |
| Rust | \u{X} to \u{XXXXXX} |
Same (variable length) |
| Go | \uXXXX |
\UXXXXXXXX |
| CSS | \XXXXXX |
Same (1–6 hex digits) |
프로그래밍 & 개발의 더 많은 용어
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
U+0000(NUL). 첫 번째 유니코드/ASCII 문자로, C/C++에서 문자열 종료자로 사용됩니다. 보안 위험: 널 …
U+FFFD(). 디코더가 유효하지 않은 바이트 시퀀스를 만났을 때 표시되는 문자 — '디코딩에 …
잘못된 인코딩으로 바이트를 디코딩할 때 생기는 깨진 텍스트. 일본어 용어(文字化け). 예: 'café'를 …
프로그래밍 언어에서 문자의 시퀀스. 내부 표현은 다양합니다: UTF-8(Go, Rust, 최신 Python), UTF-16(Java, …
유니코드 문자열의 '길이'는 단위에 따라 다릅니다: 코드 단위(JavaScript .length), 코드 포인트(Python len()), …
눈에 보이는 글리프가 없는 문자: 공백, 너비 없는 문자, 제어 문자, 서식 …
UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 …