Unicode Escape Sequences: Cross-Language Reference
Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 in Java to \N{LATIN SMALL LETTER A} in Python. This guide is a cross-language reference for Unicode escape sequence syntax, covering Python, JavaScript, Java, Go, Rust, C++, and more.
Every programming language needs a way to include arbitrary Unicode characters in source code even when the editor, terminal, or transport layer is ASCII-only. Unicode escape sequences are the solution: a portable, ASCII-safe syntax that represents any code point. This reference covers the escape syntax for the most common languages, plus HTML, CSS, and URL encoding.
The Concept
A Unicode escape sequence is a text representation of a Unicode code point using only printable ASCII characters. The exact syntax varies by language, but the code point value is always written as a hexadecimal number. For example, the RIGHT-WARDS ARROW (→, U+2192) can be written as:
| Context | Escape |
|---|---|
| Python | \\u2192 |
| JavaScript | \\u2192 or \\u{2192} |
| Java | \\u2192 |
| C# | \\u2192 |
| Go | \\u2192 |
| Rust | \\u{2192} |
| HTML | → or → |
| CSS | \2192 |
| URL | %E2%86%92 (UTF-8 percent encoding) |
Python
Python supports two Unicode escape forms in string literals:
# \\uXXXX — BMP code points (exactly 4 hex digits, U+0000 to U+FFFF)
arrow = "\\u2192" # → U+2192 RIGHTWARDS ARROW
euro = "\\u20AC" # € U+20AC EURO SIGN
alpha = "\\u03B1" # α U+03B1 GREEK SMALL LETTER ALPHA
# \\UXXXXXXXX — any code point (exactly 8 hex digits, for supplementary plane)
snake = "\\U0001F40D" # 🐍 U+1F40D SNAKE
clef = "\\U0001D11E" # 𝄞 U+1D11E MUSICAL SYMBOL G CLEF
flag = "\\U0001F1FA\\U0001F1F8" # 🇺🇸 (two code points)
# \\xNN — Latin-1 supplement (2 hex digits, U+0000 to U+00FF)
# Only recommended for bytes literals
euro_bytes = b"\\xE2\\x82\\xAC" # UTF-8 bytes for €
# Named escapes (available since Python 3.3+)
from unicodedata import lookup
arrow2 = lookup("RIGHTWARDS ARROW") # "→"
In raw strings (r"..." / r'...'), escape sequences are not processed:
path = r"C:\\users\name" # \\u and \n are literal characters
pattern = r"\\u2192" # regex pattern, not arrow
Python Bytes Literals
Bytes literals only support \\xNN and the basic ASCII escapes. They do not
support \\uXXXX:
raw_utf8 = b"\\xE2\\x86\\x92" # 3 bytes of UTF-8 for →
raw_utf8.decode("utf-8") # "→"
JavaScript
JavaScript's original \\uXXXX escape is restricted to the BMP. ES2015
introduced \\u{...} with variable-length hex to handle the full Unicode range:
// \\uXXXX — BMP only (exactly 4 hex digits)
const arrow = "\\u2192"; // → U+2192
const euro = "\\u20AC"; // € U+20AC
const alpha = "\\u03B1"; // α U+03B1
// \\u{XXXXX} — ES2015+, any code point (1 to 6 hex digits)
const snake = "\\u{1F40D}"; // 🐍 U+1F40D
const clef = "\\u{1D11E}"; // 𝄞 U+1D11E
const space = "\\u{20}"; // space (any length)
const smile = "\\u{1F600}"; // 😀 U+1F600
// Surrogate pair (legacy, avoid in new code)
const snakeLegacy = "\\uD83D\\uDC0D"; // 🐍 via surrogate pair
// Template literals — same escapes apply
const msg = `Arrow: \\u{2192} Snake: \\u{1F40D}`;
In regular expressions, the u flag enables \\u{...} escapes and correct
supplementary character matching:
/\\u{1F40D}/u.test("🐍") // true
/\\u{2192}/u.test("→") // true
Java
Java uses \\uXXXX (exactly 4 hex digits) at the language level — processed
even before compilation, so they work in identifiers and comments too:
// String literals
String arrow = "\\u2192"; // → U+2192
String euro = "\\u20AC"; // € U+20AC
// Supplementary characters require a surrogate pair (Java < 5 style)
// or Character.toChars()
String snake = new String(Character.toChars(0x1F40D)); // 🐍
// Java 15+ text blocks (three double-quotes) support literal Unicode
// String block = (three-double-quotes)
// Arrow: \\u2192
// (three-double-quotes);
// Java processes \\u escapes universally — even in comments!
// The following is a real Java quirk:
// \\u000A is a newline and WILL be interpreted as such in source code
For code points above U+FFFF, use Character.toChars(codePoint) or
String.valueOf(Character.toChars(codePoint)).
C# / .NET
C# shares the \\uXXXX syntax with Java for BMP characters, and adds
\\UXXXXXXXX (uppercase U, 8 digits) for supplementary characters:
// \\uXXXX — BMP (4 hex digits)
string arrow = "\\u2192"; // → U+2192
string euro = "\\u20AC"; // € U+20AC
// \\UXXXXXXXX — supplementary (8 hex digits)
string snake = "\\U0001F40D"; // 🐍 U+1F40D
// Verbatim strings (@"...") — escapes are NOT processed
string path = @"C:\\users\name"; // literal backslashes
// but \\u is still just \\u in verbatim strings
// char.ConvertFromUtf32 — alternative for supplementary
string snake2 = char.ConvertFromUtf32(0x1F40D); // 🐍
Go
Go source files are UTF-8, and string literals support \\uXXXX (4 digits)
and \\UXXXXXXXX (8 digits):
// \\uXXXX — BMP (4 hex digits)
arrow := "\\u2192" // → U+2192
euro := "\\u20AC" // € U+20AC
// \\UXXXXXXXX — any code point (8 hex digits)
snake := "\\U0001F40D" // 🐍 U+1F40D
// Go rune type is an int32 alias representing a Unicode code point
r := '\\u2192' // rune value 8594
fmt.Println(string(r)) // "→"
// Raw string literals use backticks — no escape processing
raw := `\\u2192` // literal string: \\u2192 (five characters)
Rust
Rust uses \\u{...} (with braces, like ES2015 JavaScript) for all Unicode
code points. The hex value can be 1 to 6 digits:
// \\u{HHHHHH} — 1 to 6 hex digits, any valid code point
let arrow = "\\u{2192}"; // → U+2192
let snake = "\\u{1F40D}"; // 🐍 U+1F40D
let clef = "\\u{1D11E}"; // 𝄞 U+1D11E
// char literal uses the same syntax
let ch: char = '\\u{2192}'; // single char, always a Unicode scalar value
// Rust's char is a Unicode scalar value (U+0000–U+D7FF, U+E000–U+10FFFF)
// Surrogate code points (U+D800–U+DFFF) are not valid Rust chars
Swift
Swift uses \\u{HHHHHH} (braces, 1–8 hex digits), identical in structure to
Rust:
let arrow = "\\u{2192}" // → U+2192
let snake = "\\u{1F40D}" // 🐍 U+1F40D
let ch: Character = "\\u{2192}"
HTML
HTML provides both decimal and hexadecimal numeric character references:
<!-- Decimal: &#NNNN; -->
→ <!-- → U+2192 decimal 8594 -->
🐍 <!-- 🐍 U+1F40D decimal 128013 -->
<!-- Hex: &#xHHHH; (case-insensitive x) -->
→ <!-- → U+2192 -->
🐍 <!-- 🐍 U+1F40D -->
→ <!-- also valid (uppercase X) -->
<!-- Named entities (only for defined names) -->
→ <!-- → -->
€ <!-- € -->
© <!-- © -->
CSS
CSS uses a backslash followed by 1–6 hex digits, optionally terminated by a space (the terminating space is consumed):
/* Backslash + 1-6 hex digits */
.icon::before { content: "\2192"; } /* → */
.icon::before { content: "\1F40D"; } /* 🐍 */
/* Space terminates the escape (the space is consumed) */
.icon::before { content: "\2192 text"; } /* "→text" */
.icon::before { content: "\2192 text"; } /* "→ text" (two spaces → one preserved) */
/* In selectors — escape non-identifier characters */
[data-value="\2192"] { ... }
URL / Percent Encoding
URLs encode non-ASCII characters as their UTF-8 byte sequence, with each byte
written as %XX (two uppercase hex digits):
→ U+2192 UTF-8: E2 86 92 URL: %E2%86%92
€ U+20AC UTF-8: E2 82 AC URL: %E2%82%AC
🐍 U+1F40D UTF-8: F0 9F 90 8D URL: %F0%9F%90%8D
In JavaScript:
encodeURIComponent("→") // "%E2%86%92"
decodeURIComponent("%E2%86%92") // "→"
encodeURI("https://example.com/café")
// "https://example.com/caf%C3%A9"
Cross-Language Quick Reference
| Code Point | U+2192 → | U+20AC € | U+1F40D 🐍 |
|---|---|---|---|
| Python | \\u2192 |
\\u20AC |
\\U0001F40D |
| JavaScript | \\u2192 |
\\u20AC |
\\u{1F40D} |
| Java | \\u2192 |
\\u20AC |
surrogate pair |
| C# | \\u2192 |
\\u20AC |
\\U0001F40D |
| Go | \\u2192 |
\\u20AC |
\\U0001F40D |
| Rust | \\u{2192} |
\\u{20AC} |
\\u{1F40D} |
| HTML | → |
€ |
🐍 |
| CSS | \2192 |
\20AC |
\1F40D |
| URL | %E2%86%92 |
%E2%82%AC |
%F0%9F%90%8D |
Key Takeaways
- 4-digit
\\uXXXXis the BMP-only form, supported in Python, JavaScript, Java, C#, Go. - 8-digit
\\UXXXXXXXXhandles supplementary characters in Python, C#, Go. - Braced
\\u{...}with variable length is used in JavaScript (ES2015+), Rust, and Swift — the most ergonomic form. - CSS uses a bare backslash without braces:
\2192(1–6 hex digits, space-terminated). - HTML uses
&#xHHHH;or&#DDDD;— note the&,#,x, and terminating;. - URL encoding is not a Unicode escape per se — it is the UTF-8 bytes of the character percent-encoded.
- Surrogate pair escapes (
\\uD83D\\uDC0D) are a legacy JavaScript workaround; always prefer\\u{1F40D}in modern code.
More in Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
JSON is defined as Unicode text and must be encoded in UTF-8, …