💻 Unicode in Code

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 in Java to \N{LATIN SMALL LETTER A} in Python. This guide is a cross-language reference for Unicode escape sequence syntax, covering Python, JavaScript, Java, Go, Rust, C++, and more.

Published 2022-07-28 · Updated 2025-01-27

Every programming language needs a way to include arbitrary Unicode characters in source code even when the editor, terminal, or transport layer is ASCII-only. Unicode escape sequences are the solution: a portable, ASCII-safe syntax that represents any code point. This reference covers the escape syntax for the most common languages, plus HTML, CSS, and URL encoding.

The Concept

A Unicode escape sequence is a text representation of a Unicode code point using only printable ASCII characters. The exact syntax varies by language, but the code point value is always written as a hexadecimal number. For example, the RIGHT-WARDS ARROW (→, U+2192) can be written as:

Context	Escape
Python	`\\u2192`
JavaScript	`\\u2192` or `\\u{2192}`
Java	`\\u2192`
C#	`\\u2192`
Go	`\\u2192`
Rust	`\\u{2192}`
HTML	`→` or `→`
CSS	`\2192`
URL	`%E2%86%92` (UTF-8 percent encoding)

Python

Python supports two Unicode escape forms in string literals:

# \\uXXXX — BMP code points (exactly 4 hex digits, U+0000 to U+FFFF)
arrow = "\\u2192"     # → U+2192 RIGHTWARDS ARROW
euro  = "\\u20AC"     # € U+20AC EURO SIGN
alpha = "\\u03B1"     # α U+03B1 GREEK SMALL LETTER ALPHA

# \\UXXXXXXXX — any code point (exactly 8 hex digits, for supplementary plane)
snake = "\\U0001F40D"  # 🐍 U+1F40D SNAKE
clef  = "\\U0001D11E"  # 𝄞 U+1D11E MUSICAL SYMBOL G CLEF
flag  = "\\U0001F1FA\\U0001F1F8"  # 🇺🇸 (two code points)

# \\xNN — Latin-1 supplement (2 hex digits, U+0000 to U+00FF)
# Only recommended for bytes literals
euro_bytes = b"\\xE2\\x82\\xAC"  # UTF-8 bytes for €

# Named escapes (available since Python 3.3+)
from unicodedata import lookup
arrow2 = lookup("RIGHTWARDS ARROW")  # "→"

In raw strings (r"..." / r'...'), escape sequences are not processed:

path    = r"C:\\users\name"   # \\u and \n are literal characters
pattern = r"\\u2192"          # regex pattern, not arrow

Python Bytes Literals

Bytes literals only support \\xNN and the basic ASCII escapes. They do not support \\uXXXX:

raw_utf8 = b"\\xE2\\x86\\x92"   # 3 bytes of UTF-8 for →
raw_utf8.decode("utf-8")      # "→"

JavaScript

JavaScript's original \\uXXXX escape is restricted to the BMP. ES2015 introduced \\u{...} with variable-length hex to handle the full Unicode range:

// \\uXXXX — BMP only (exactly 4 hex digits)
const arrow = "\\u2192";   // → U+2192
const euro  = "\\u20AC";   // € U+20AC
const alpha = "\\u03B1";   // α U+03B1

// \\u{XXXXX} — ES2015+, any code point (1 to 6 hex digits)
const snake  = "\\u{1F40D}";          // 🐍 U+1F40D
const clef   = "\\u{1D11E}";          // 𝄞 U+1D11E
const space  = "\\u{20}";             // space (any length)
const smile  = "\\u{1F600}";          // 😀 U+1F600

// Surrogate pair (legacy, avoid in new code)
const snakeLegacy = "\\uD83D\\uDC0D";  // 🐍 via surrogate pair

// Template literals — same escapes apply
const msg = `Arrow: \\u{2192} Snake: \\u{1F40D}`;

In regular expressions, the u flag enables \\u{...} escapes and correct supplementary character matching:

/\\u{1F40D}/u.test("🐍")   // true
/\\u{2192}/u.test("→")     // true

Java

Java uses \\uXXXX (exactly 4 hex digits) at the language level — processed even before compilation, so they work in identifiers and comments too:

// String literals
String arrow = "\\u2192";         // → U+2192
String euro  = "\\u20AC";         // € U+20AC

// Supplementary characters require a surrogate pair (Java < 5 style)
// or Character.toChars()
String snake = new String(Character.toChars(0x1F40D));  // 🐍

// Java 15+ text blocks (three double-quotes) support literal Unicode
// String block = (three-double-quotes)
//     Arrow: \\u2192
// (three-double-quotes);

// Java processes \\u escapes universally — even in comments!
// The following is a real Java quirk:
// \\u000A is a newline and WILL be interpreted as such in source code

For code points above U+FFFF, use Character.toChars(codePoint) or String.valueOf(Character.toChars(codePoint)).

C# / .NET

C# shares the \\uXXXX syntax with Java for BMP characters, and adds \\UXXXXXXXX (uppercase U, 8 digits) for supplementary characters:

// \\uXXXX — BMP (4 hex digits)
string arrow = "\\u2192";    // → U+2192
string euro  = "\\u20AC";    // € U+20AC

// \\UXXXXXXXX — supplementary (8 hex digits)
string snake = "\\U0001F40D";  // 🐍 U+1F40D

// Verbatim strings (@"...") — escapes are NOT processed
string path = @"C:\\users\name";   // literal backslashes
// but \\u is still just \\u in verbatim strings

// char.ConvertFromUtf32 — alternative for supplementary
string snake2 = char.ConvertFromUtf32(0x1F40D);  // 🐍

Go

Go source files are UTF-8, and string literals support \\uXXXX (4 digits) and \\UXXXXXXXX (8 digits):

// \\uXXXX — BMP (4 hex digits)
arrow := "\\u2192"     // → U+2192
euro  := "\\u20AC"     // € U+20AC

// \\UXXXXXXXX — any code point (8 hex digits)
snake := "\\U0001F40D"  // 🐍 U+1F40D

// Go rune type is an int32 alias representing a Unicode code point
r := '\\u2192'          // rune value 8594
fmt.Println(string(r)) // "→"

// Raw string literals use backticks — no escape processing
raw := `\\u2192`        // literal string: \\u2192 (five characters)

Rust

Rust uses \\u{...} (with braces, like ES2015 JavaScript) for all Unicode code points. The hex value can be 1 to 6 digits:

// \\u{HHHHHH} — 1 to 6 hex digits, any valid code point
let arrow = "\\u{2192}";     // → U+2192
let snake = "\\u{1F40D}";    // 🐍 U+1F40D
let clef  = "\\u{1D11E}";    // 𝄞 U+1D11E

// char literal uses the same syntax
let ch: char = '\\u{2192}';  // single char, always a Unicode scalar value

// Rust's char is a Unicode scalar value (U+0000–U+D7FF, U+E000–U+10FFFF)
// Surrogate code points (U+D800–U+DFFF) are not valid Rust chars

Swift

Swift uses \\u{HHHHHH} (braces, 1–8 hex digits), identical in structure to Rust:

let arrow = "\\u{2192}"    // → U+2192
let snake = "\\u{1F40D}"   // 🐍 U+1F40D
let ch: Character = "\\u{2192}"

HTML

HTML provides both decimal and hexadecimal numeric character references:

<!-- Decimal: &#NNNN; -->
&#8594;     <!-- → U+2192 decimal 8594 -->
&#128013;   <!-- 🐍 U+1F40D decimal 128013 -->

<!-- Hex: &#xHHHH; (case-insensitive x) -->
&#x2192;    <!-- → U+2192 -->
&#x1F40D;   <!-- 🐍 U+1F40D -->
&#X2192;    <!-- also valid (uppercase X) -->

<!-- Named entities (only for defined names) -->
&rarr;      <!-- → -->
&euro;      <!-- € -->
&copy;      <!-- © -->

CSS

CSS uses a backslash followed by 1–6 hex digits, optionally terminated by a space (the terminating space is consumed):

/* Backslash + 1-6 hex digits */
.icon::before { content: "\2192"; }    /* → */
.icon::before { content: "\1F40D"; }   /* 🐍 */

/* Space terminates the escape (the space is consumed) */
.icon::before { content: "\2192 text"; }  /* "→text" */
.icon::before { content: "\2192  text"; } /* "→ text" (two spaces → one preserved) */

/* In selectors — escape non-identifier characters */
[data-value="\2192"] { ... }

URL / Percent Encoding

URLs encode non-ASCII characters as their UTF-8 byte sequence, with each byte written as %XX (two uppercase hex digits):

→  U+2192   UTF-8: E2 86 92   URL: %E2%86%92
€  U+20AC   UTF-8: E2 82 AC   URL: %E2%82%AC
🐍 U+1F40D  UTF-8: F0 9F 90 8D URL: %F0%9F%90%8D

In JavaScript:

encodeURIComponent("→")    // "%E2%86%92"
decodeURIComponent("%E2%86%92")  // "→"

encodeURI("https://example.com/café")
// "https://example.com/caf%C3%A9"

Cross-Language Quick Reference

Code Point	U+2192 →	U+20AC €	U+1F40D 🐍
Python	`\\u2192`	`\\u20AC`	`\\U0001F40D`
JavaScript	`\\u2192`	`\\u20AC`	`\\u{1F40D}`
Java	`\\u2192`	`\\u20AC`	surrogate pair
C#	`\\u2192`	`\\u20AC`	`\\U0001F40D`
Go	`\\u2192`	`\\u20AC`	`\\U0001F40D`
Rust	`\\u{2192}`	`\\u{20AC}`	`\\u{1F40D}`
HTML	`→`	`€`	`🐍`
CSS	`\2192`	`\20AC`	`\1F40D`
URL	`%E2%86%92`	`%E2%82%AC`	`%F0%9F%90%8D`

Key Takeaways

4-digit \\uXXXX is the BMP-only form, supported in Python, JavaScript, Java, C#, Go.
8-digit \\UXXXXXXXX handles supplementary characters in Python, C#, Go.
Braced \\u{...} with variable length is used in JavaScript (ES2015+), Rust, and Swift — the most ergonomic form.
CSS uses a bare backslash without braces: \2192 (1–6 hex digits, space-terminated).
HTML uses &#xHHHH; or &#DDDD; — note the &, #, x, and terminating ;.
URL encoding is not a Unicode escape per se — it is the UTF-8 bytes of the character percent-encoded.
Surrogate pair escapes (\\uD83D\\uDC0D) are a legacy JavaScript workaround; always prefer \\u{1F40D} in modern code.

More in Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Back to Guides