💻 Unicode in Code

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 in Java to \N{LATIN SMALL LETTER A} in Python. This guide is a cross-language reference for Unicode escape sequence syntax, covering Python, JavaScript, Java, Go, Rust, C++, and more.

·

Every programming language needs a way to include arbitrary Unicode characters in source code even when the editor, terminal, or transport layer is ASCII-only. Unicode escape sequences are the solution: a portable, ASCII-safe syntax that represents any code point. This reference covers the escape syntax for the most common languages, plus HTML, CSS, and URL encoding.

The Concept

A Unicode escape sequence is a text representation of a Unicode code point using only printable ASCII characters. The exact syntax varies by language, but the code point value is always written as a hexadecimal number. For example, the RIGHT-WARDS ARROW (→, U+2192) can be written as:

Context Escape
Python \\u2192
JavaScript \\u2192 or \\u{2192}
Java \\u2192
C# \\u2192
Go \\u2192
Rust \\u{2192}
HTML → or →
CSS \2192
URL %E2%86%92 (UTF-8 percent encoding)

Python

Python supports two Unicode escape forms in string literals:

# \\uXXXX — BMP code points (exactly 4 hex digits, U+0000 to U+FFFF)
arrow = "\\u2192"     # → U+2192 RIGHTWARDS ARROW
euro  = "\\u20AC"     # € U+20AC EURO SIGN
alpha = "\\u03B1"     # α U+03B1 GREEK SMALL LETTER ALPHA

# \\UXXXXXXXX — any code point (exactly 8 hex digits, for supplementary plane)
snake = "\\U0001F40D"  # 🐍 U+1F40D SNAKE
clef  = "\\U0001D11E"  # 𝄞 U+1D11E MUSICAL SYMBOL G CLEF
flag  = "\\U0001F1FA\\U0001F1F8"  # 🇺🇸 (two code points)

# \\xNN — Latin-1 supplement (2 hex digits, U+0000 to U+00FF)
# Only recommended for bytes literals
euro_bytes = b"\\xE2\\x82\\xAC"  # UTF-8 bytes for €

# Named escapes (available since Python 3.3+)
from unicodedata import lookup
arrow2 = lookup("RIGHTWARDS ARROW")  # "→"

In raw strings (r"..." / r'...'), escape sequences are not processed:

path    = r"C:\\users\name"   # \\u and \n are literal characters
pattern = r"\\u2192"          # regex pattern, not arrow

Python Bytes Literals

Bytes literals only support \\xNN and the basic ASCII escapes. They do not support \\uXXXX:

raw_utf8 = b"\\xE2\\x86\\x92"   # 3 bytes of UTF-8 for →
raw_utf8.decode("utf-8")      # "→"

JavaScript

JavaScript's original \\uXXXX escape is restricted to the BMP. ES2015 introduced \\u{...} with variable-length hex to handle the full Unicode range:

// \\uXXXX — BMP only (exactly 4 hex digits)
const arrow = "\\u2192";   // → U+2192
const euro  = "\\u20AC";   // € U+20AC
const alpha = "\\u03B1";   // α U+03B1

// \\u{XXXXX} — ES2015+, any code point (1 to 6 hex digits)
const snake  = "\\u{1F40D}";          // 🐍 U+1F40D
const clef   = "\\u{1D11E}";          // 𝄞 U+1D11E
const space  = "\\u{20}";             // space (any length)
const smile  = "\\u{1F600}";          // 😀 U+1F600

// Surrogate pair (legacy, avoid in new code)
const snakeLegacy = "\\uD83D\\uDC0D";  // 🐍 via surrogate pair

// Template literals — same escapes apply
const msg = `Arrow: \\u{2192} Snake: \\u{1F40D}`;

In regular expressions, the u flag enables \\u{...} escapes and correct supplementary character matching:

/\\u{1F40D}/u.test("🐍")   // true
/\\u{2192}/u.test("→")     // true

Java

Java uses \\uXXXX (exactly 4 hex digits) at the language level — processed even before compilation, so they work in identifiers and comments too:

// String literals
String arrow = "\\u2192";         // → U+2192
String euro  = "\\u20AC";         // € U+20AC

// Supplementary characters require a surrogate pair (Java < 5 style)
// or Character.toChars()
String snake = new String(Character.toChars(0x1F40D));  // 🐍

// Java 15+ text blocks (three double-quotes) support literal Unicode
// String block = (three-double-quotes)
//     Arrow: \\u2192
// (three-double-quotes);

// Java processes \\u escapes universally — even in comments!
// The following is a real Java quirk:
// \\u000A is a newline and WILL be interpreted as such in source code

For code points above U+FFFF, use Character.toChars(codePoint) or String.valueOf(Character.toChars(codePoint)).

C# / .NET

C# shares the \\uXXXX syntax with Java for BMP characters, and adds \\UXXXXXXXX (uppercase U, 8 digits) for supplementary characters:

// \\uXXXX — BMP (4 hex digits)
string arrow = "\\u2192";    // → U+2192
string euro  = "\\u20AC";    // € U+20AC

// \\UXXXXXXXX — supplementary (8 hex digits)
string snake = "\\U0001F40D";  // 🐍 U+1F40D

// Verbatim strings (@"...") — escapes are NOT processed
string path = @"C:\\users\name";   // literal backslashes
// but \\u is still just \\u in verbatim strings

// char.ConvertFromUtf32 — alternative for supplementary
string snake2 = char.ConvertFromUtf32(0x1F40D);  // 🐍

Go

Go source files are UTF-8, and string literals support \\uXXXX (4 digits) and \\UXXXXXXXX (8 digits):

// \\uXXXX — BMP (4 hex digits)
arrow := "\\u2192"     // → U+2192
euro  := "\\u20AC"     // € U+20AC

// \\UXXXXXXXX — any code point (8 hex digits)
snake := "\\U0001F40D"  // 🐍 U+1F40D

// Go rune type is an int32 alias representing a Unicode code point
r := '\\u2192'          // rune value 8594
fmt.Println(string(r)) // "→"

// Raw string literals use backticks — no escape processing
raw := `\\u2192`        // literal string: \\u2192 (five characters)

Rust

Rust uses \\u{...} (with braces, like ES2015 JavaScript) for all Unicode code points. The hex value can be 1 to 6 digits:

// \\u{HHHHHH} — 1 to 6 hex digits, any valid code point
let arrow = "\\u{2192}";     // → U+2192
let snake = "\\u{1F40D}";    // 🐍 U+1F40D
let clef  = "\\u{1D11E}";    // 𝄞 U+1D11E

// char literal uses the same syntax
let ch: char = '\\u{2192}';  // single char, always a Unicode scalar value

// Rust's char is a Unicode scalar value (U+0000–U+D7FF, U+E000–U+10FFFF)
// Surrogate code points (U+D800–U+DFFF) are not valid Rust chars

Swift

Swift uses \\u{HHHHHH} (braces, 1–8 hex digits), identical in structure to Rust:

let arrow = "\\u{2192}"    // → U+2192
let snake = "\\u{1F40D}"   // 🐍 U+1F40D
let ch: Character = "\\u{2192}"

HTML

HTML provides both decimal and hexadecimal numeric character references:

<!-- Decimal: &#NNNN; -->
&#8594;     <!-- → U+2192 decimal 8594 -->
&#128013;   <!-- 🐍 U+1F40D decimal 128013 -->

<!-- Hex: &#xHHHH; (case-insensitive x) -->
&#x2192;    <!-- → U+2192 -->
&#x1F40D;   <!-- 🐍 U+1F40D -->
&#X2192;    <!-- also valid (uppercase X) -->

<!-- Named entities (only for defined names) -->
&rarr;      <!-- → -->
&euro;      <!-- € -->
&copy;      <!-- © -->

CSS

CSS uses a backslash followed by 1–6 hex digits, optionally terminated by a space (the terminating space is consumed):

/* Backslash + 1-6 hex digits */
.icon::before { content: "\2192"; }    /* → */
.icon::before { content: "\1F40D"; }   /* 🐍 */

/* Space terminates the escape (the space is consumed) */
.icon::before { content: "\2192 text"; }  /* "→text" */
.icon::before { content: "\2192  text"; } /* "→ text" (two spaces → one preserved) */

/* In selectors — escape non-identifier characters */
[data-value="\2192"] { ... }

URL / Percent Encoding

URLs encode non-ASCII characters as their UTF-8 byte sequence, with each byte written as %XX (two uppercase hex digits):

→  U+2192   UTF-8: E2 86 92   URL: %E2%86%92
€  U+20AC   UTF-8: E2 82 AC   URL: %E2%82%AC
🐍 U+1F40D  UTF-8: F0 9F 90 8D URL: %F0%9F%90%8D

In JavaScript:

encodeURIComponent("→")    // "%E2%86%92"
decodeURIComponent("%E2%86%92")  // "→"

encodeURI("https://example.com/café")
// "https://example.com/caf%C3%A9"

Cross-Language Quick Reference

Code Point U+2192 → U+20AC € U+1F40D 🐍
Python \\u2192 \\u20AC \\U0001F40D
JavaScript \\u2192 \\u20AC \\u{1F40D}
Java \\u2192 \\u20AC surrogate pair
C# \\u2192 \\u20AC \\U0001F40D
Go \\u2192 \\u20AC \\U0001F40D
Rust \\u{2192} \\u{20AC} \\u{1F40D}
HTML &#x2192; &#x20AC; &#x1F40D;
CSS \2192 \20AC \1F40D
URL %E2%86%92 %E2%82%AC %F0%9F%90%8D

Key Takeaways

  • 4-digit \\uXXXX is the BMP-only form, supported in Python, JavaScript, Java, C#, Go.
  • 8-digit \\UXXXXXXXX handles supplementary characters in Python, C#, Go.
  • Braced \\u{...} with variable length is used in JavaScript (ES2015+), Rust, and Swift — the most ergonomic form.
  • CSS uses a bare backslash without braces: \2192 (1–6 hex digits, space-terminated).
  • HTML uses &#xHHHH; or &#DDDD; — note the &, #, x, and terminating ;.
  • URL encoding is not a Unicode escape per se — it is the UTF-8 bytes of the character percent-encoded.
  • Surrogate pair escapes (\\uD83D\\uDC0D) are a legacy JavaScript workaround; always prefer \\u{1F40D} in modern code.

المزيد في Unicode in Code