Unicode Glossary
150 essential Unicode terms explained — from character encoding fundamentals to security concepts.
Encoding (17)
American Standard Code for Information Interchange. 7-bit encoding covering 128 characters (0–127): control characters, digits, Latin letters, and basic symbols.
Visual art created from text characters, originally limited to the 95 printable ASCII characters. Unicode expands the palette with box-drawing …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, 0–9, +, /). Used for embedding binary data …
Traditional Chinese character encoding used primarily in Taiwan and Hong Kong, encoding approximately 13,000 CJK characters.
U+FEFF placed at the start of a text stream to indicate byte order and encoding. Essential for UTF-16/32, optional and …
A system that maps characters to byte sequences for digital storage and transmission. Every text file has an encoding — …
Extended Binary Coded Decimal Interchange Code. IBM mainframe encoding with non-contiguous letter ranges, still used in banking and enterprise mainframes.
Korean character encoding based on KS X 1001, mapping Hangul syllables and Hanja to double-byte sequences.
Simplified Chinese character encoding family: GB2312 (6,763 characters) evolved to GBK and then GB18030, the mandatory Unicode-compatible Chinese national standard.
Official registry of character encoding names maintained by IANA, used in HTTP Content-Type headers and MIME (e.g., charset=utf-8).
Family of 8-bit single-byte encodings for different language groups. ISO 8859-1 (Latin-1) was the basis for Unicode's first 256 code …
Japanese character encoding combining single-byte ASCII/JIS Roman with double-byte JIS X 0208 kanji. Still used in legacy Japanese systems.
Obsolete fixed-length 2-byte encoding covering only the BMP (U+0000–U+FFFF). Predecessor to UTF-16 that cannot represent supplementary characters.
Variable-length Unicode encoding using 2 or 4 bytes (1 or 2 code units of 16 bits). Used internally by Java, …
Fixed-length Unicode encoding using exactly 4 bytes per character. Simple but space-inefficient; used internally by Python 3 (CPython).
Variable-length Unicode encoding using 1–4 bytes per character. The dominant encoding of the web (98%+ of websites) with full ASCII …
Microsoft's superset of ISO 8859-1, adding smart quotes, em dash, and euro sign in the 0x80–0x9F range. The most common …
Unicode Standard (25)
A unit of information used for organizing, controlling, or representing textual data — the conceptual entity before it receives a …
A code point that has been given a character designation in a Unicode version. As of Unicode 16.0, 154,998 code …
Plane 0 (U+0000–U+FFFF), containing the most commonly used characters including Latin, Greek, Cyrillic, CJK, Arabic, and most symbols. Characters here …
Chinese, Japanese, and Korean — the collective term for the unified Han ideograph block and related scripts in Unicode. CJK …
A numerical value in the Unicode code space (U+0000 to U+10FFFF), written as U+XXXX. Not all code points are assigned …
The complete range of possible Unicode code points: U+0000 to U+10FFFF (1,114,112 total), divided into 17 planes of 65,536 code …
The minimal unit of encoding: an 8-bit byte in UTF-8, a 16-bit word in UTF-16, a 32-bit word in UTF-32. …
The process of mapping Chinese, Japanese, and Korean ideographs that share a common historical origin to a single Unicode code …
The individual consonant and vowel components (jamo) of the Korean Hangul writing system. Unicode encodes both precomposed Hangul syllables (U+AC00–U+D7A3) …
International standard (ISO/IEC 10646) synchronized with Unicode, defining the same character repertoire and code points but without Unicode's additional algorithms …
Code points permanently reserved for internal use (66 total): U+FDD0–U+FDEF and U+nFFFE/U+nFFFF for each plane. Valid in text but should …
A contiguous block of 65,536 code points. Unicode has 17 planes (0–16): Plane 0 is the BMP, Plane 1 is …
Reserved ranges where organizations can assign their own characters: BMP PUA (U+E000–U+F8FF) plus Supplementary PUAs in Planes 15 and 16.
A code point set aside for future standardization, distinct from noncharacters (permanently reserved) and private use areas (user-assignable).
Planes 1–16 (U+10000–U+10FFFF), containing emoji, historic scripts, CJK extensions, and musical notation. Requires surrogate pairs in UTF-16.
Code points U+D800–U+DFFF reserved exclusively for UTF-16 surrogate pairs. Not valid Unicode scalar values and should never appear as standalone …
A code point not yet assigned a character in any Unicode version, categorized as Cn (Unassigned). May be assigned in …
Universal character encoding standard assigning a unique number (code point) to every character in every writing system. Version 16.0 contains …
Machine-readable collection of data files defining all Unicode character properties, including UnicodeData.txt, Blocks.txt, Scripts.txt, and many more.
Non-profit organization that develops and maintains the Unicode Standard. Members include Apple, Google, Microsoft, Meta, and many others.
Any code point except surrogate code points (U+D800–U+DFFF). The valid set of values that can represent actual characters, totaling 1,112,064.
Guarantee that once a character is assigned, its code point and name never change. Properties may be refined but assignments …
Normative or informative documents that are integral parts of the Unicode Standard. UAX#9 (Bidi Algorithm), UAX#11 (East Asian Width), UAX#15 …
Informational documents published by the Unicode Consortium covering specific topics like security mechanisms (UTR#39), text segmentation (UTR#29), and line breaking …
Major releases of the Unicode Standard, each adding new characters, scripts, and features. The current version is Unicode 16.0 (September …
Properties (19)
The Unicode version in which a character was first assigned. Useful for determining character support across systems and software versions.
Property determining how a character behaves in bidirectional text (LTR, RTL, weak, neutral). Used by the Unicode Bidirectional Algorithm to …
A named contiguous range of code points (e.g., Basic Latin = U+0000–U+007F). Unicode 16.0 defines 336 blocks; every code point …
Two character sequences that are semantically identical and should be treated as equal. Example: é (U+00E9) ≡ e + ◌́ …
The rules for converting characters between uppercase, lowercase, and titlecase. Can be locale-dependent (Turkish I problem) and one-to-many (ß → …
Numeric value (0–254) controlling the ordering of combining marks during canonical decomposition, determining which combining marks can be reordered.
Two character sequences with the same abstract content that may differ in appearance. Broader than canonical equivalence. Example: fi ≈ …
The mapping of a character to its component parts. Canonical decomposition preserves meaning (é → e + ́); compatibility decomposition …
Characters that should have no visible effect and can be ignored by processes that do not support them, including variation …
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or Neutral. Wide characters (CJK ideographs, katakana) occupy two …
The user-perceived 'character' — what feels like a single unit. May consist of multiple code points (base + combining marks, …
Classification of every code point into one of 30 categories (Lu, Ll, Nd, So, etc.) grouped into 7 major classes: …
Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. Types include Right_Joining, Left_Joining, Dual_Joining, and Non_Joining.
Characters whose glyph should be horizontally mirrored in RTL context. Examples: ( → ), [ → ], { → }, …
Alternative names for characters, since Unicode names cannot change per the stability policy. Used for corrections, abbreviations, and figments.
The numeric interpretation of a character, if any: digit value (0–9), decimal value, or general numeric value (e.g., ½ = …
Characters used to organize and clarify written language: periods, commas, dashes, quotation marks, and more. Unicode General Category P covers …
The writing system a character belongs to (e.g., Latin, Cyrillic, Han). Unicode 16.0 defines 168 scripts; the Script property is …
Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have …
Algorithms (15)
Mapping characters to a common case form for case-insensitive comparison. More comprehensive than lowercasing: German ß → ss, Turkish İ …
Characters excluded from canonical composition (NFC) to prevent non-starter decomposition and ensure algorithmic stability. Listed in CompositionExclusions.txt.
Rules (UAX#29) for determining where one user-perceived character ends and another begins. Critical for cursor movement, text selection, and correctly …
Normalization Form C: decompose then recompose canonically, producing the shortest form. Recommended for data storage and interchange; the web standard …
Normalization Form D: fully decompose without recomposing. Used by the macOS HFS+ filesystem. é (U+00E9) → e + ◌́ (U+0065 …
Normalization Form KC: compatibility decomposition then canonical composition. Merges visually similar characters (fi→fi, ²→2, Ⅳ→IV). Used for identifier comparison.
Normalization Form KD: compatibility decomposition without recomposing. The most aggressive normalization, losing the most formatting information.
The position between sentences per Unicode rules. More complex than splitting on periods — handles abbreviations (Mr.), ellipsis (...), and …
Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary comparison of code points alone gives incorrect results …
Algorithm determining display order of characters in mixed-direction text (e.g., English + Arabic), using character bidi categories and explicit directional …
Standard algorithm for comparing and sorting Unicode strings using multi-level comparison: base character → accents → case → tie-breakers. Locale-customizable.
Rules for determining where text can wrap to the next line, considering character properties, CJK word boundaries, and break opportunities.
Process of converting Unicode text to a standard canonical form. Four forms: NFC (composed), NFD (decomposed), NFKC (compatibility composed), NFKD …
Algorithms for finding boundaries in text: grapheme cluster, word, and sentence boundaries. Critical for cursor movement, text selection, and text …
The position between words as determined by Unicode word break rules. Not a simple split on spaces — handles CJK …
Typography (19)
A character that attaches to the preceding base character to modify it. General Category: Mn (nonspacing), Mc (spacing combining), Me …
CSS @font-face descriptor specifying which Unicode code points a font should cover. Enables downloading only the font subset needed for …
Punctuation marks used to separate parts of a sentence or indicate ranges. Unicode defines multiple dashes: hyphen (‐), en dash …
A mark added to a letter to change pronunciation or meaning. Can be precomposed (é U+00E9) or combining (e + …
U+2026 HORIZONTAL ELLIPSIS (…). A single character replacing three periods, typographically correct and counting as 1 character instead of 3.
Em: a width equal to the font size. En: half an em. Used to define em dash width, em space, …
A specific implementation of a typeface at a particular size, weight, and style. In digital typography, a font file (TTF, …
The mechanism by which a rendering engine substitutes glyphs from a secondary font when the primary font lacks coverage for …
The visual representation of a character as rendered by a font. One character may have multiple glyphs (ligatures, contextual forms); …
Adjusting the spacing between specific character pairs for visual harmony (e.g., AV, To, LT). A font feature, not a Unicode …
Two or more characters joined into a single glyph. Can be typographic (fi → fi via OpenType) or a Unicode …
U+00A0. A space that prevents line breaking at its position. HTML: . Used between numbers and units (100 km), in …
Modern font format developed by Microsoft and Adobe supporting up to 65,535 glyphs, advanced typographic features (ligatures, alternates, kerning), and …
Paired punctuation marks enclosing direct speech or quotations. Unicode includes straight (""), curly (“”), guillemets (« »), CJK corner brackets …
Text directionality where characters flow from right to left. Used by Arabic, Hebrew, Thaana, and other scripts; requires the Bidirectional …
Uppercase letterforms at the height of lowercase letters. CSS: font-variant: small-caps. Unicode also has actual small capital letters in Latin …
Fonts downloaded by the browser to render text, declared via CSS @font-face. WOFF2 is the standard format. Unicode subsetting and …
Characters that represent horizontal or vertical space but have no visible glyph. Unicode defines 17+ whitespace characters with different widths …
Characters with zero advance width — invisible in rendering but affecting text behavior. Includes ZWSP (word break), ZWJ (join), ZWNJ …
Input Methods (9)
Windows input method using Alt + numpad digits to type characters by their code page number (Alt+0169 → ©, Alt+0176 …
GUI utility for browsing and inserting Unicode characters. Windows: charmap.exe. Mac: Character Viewer (Control+Command+Space). Linux: gucharmap.
A system-level tool for browsing and inserting Unicode characters. macOS Character Viewer (Ctrl+Cmd+Space), Windows Character Map (charmap.exe), and Linux gucharmap …
UI component (native or web-based) for browsing and selecting characters visually. Emoji pickers on mobile are the most common example.
A key (usually Right Alt or custom-mapped) that starts a multi-key composition sequence. A Linux/Unix feature: Compose + a + …
A key that produces no output immediately but modifies the next keystroke. Used for diacritics: pressing ` then e produces …
Direct Unicode code point entry by typing the hex value. Mac: hold Option + hex + release. Windows: type hex …
Software component enabling input of complex characters (CJK, Korean, etc.) using a standard keyboard, converting keystroke sequences into characters through …
Any method for entering characters by their Unicode code point: hex input (Mac), U+XXXX entry via Ctrl+Shift+U (Linux), or Alt+X …
Web & HTML (16)
HTTP header parameter declaring the character encoding of a response (Content-Type: text/html; charset=utf-8). Overrides any in-document encoding declaration.
CSS property inserting generated content via ::before and ::after pseudo-elements using Unicode escapes: content: "\2713" inserts ✓.
CSS properties (direction, writing-mode, unicode-bidi) controlling text layout direction. Works with Unicode Bidi Algorithm for mixed LTR/RTL content in web …
Rendering a character with a colorful emoji glyph, typically using Variation Selector 16 (U+FE0F). Some characters default to emoji presentation, …
A textual representation of a character in HTML. Three forms: named (&), decimal (&), hexadecimal (&). Essential for characters that …
Domain names containing non-ASCII Unicode characters, internally stored as Punycode (xn--...) but displayed in Unicode to users. Security concern: homograph …
ECMAScript Internationalization API providing locale-aware string comparison (Collator), number formatting (NumberFormat), date formatting (DateTimeFormat), and segmentation (Segmenter).
HTML entity using a human-readable name: © → ©, — → —. HTML5 defines 2,231 named references; they are case-sensitive.
HTML entity using the Unicode code point number: decimal (© → ©) or hexadecimal (© → ©). Works for any …
Encoding non-ASCII and reserved characters in URLs by replacing each byte with %XX. UTF-8 is used first, then each byte …
ASCII-compatible encoding of Unicode domain names, converting internationalized labels to xn-- prefixed ASCII strings. münchen.de → xn--mnchen-3ya.de.
Rendering a character with a plain monochrome text glyph rather than a colorful emoji, typically using Variation Selector 15 (U+FE0E) …
CSS supports Unicode via escape sequences (\2713 for ✓), the content property for generated text, unicode-range for font subsetting, and …
Characters (U+FE00–U+FE0F, U+E0100–U+E01EF) that select a specific glyph variant. VS15 (U+FE0E) = text presentation, VS16 (U+FE0F) = emoji presentation.
U+2060. A zero-width character that prevents line breaking. The modern replacement for U+FEFF (BOM) as a zero-width no-break space.
XML's version of numeric character references: ✓ or ✓. XML has only 5 named entities (& < > " '), …
Programming & Development (13)
Encoding converts characters to bytes (str.encode('utf-8')); decoding converts bytes to characters (bytes.decode('utf-8')). Getting this right prevents mojibake.
Any character with no visible glyph: whitespace, zero-width characters, control characters, and formatting characters. Can cause security issues such as …
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary characters, use codePointAt() and Character.toChars(). Java's \uXXXX escapes …
Garbled text resulting from decoding bytes with the wrong encoding. Japanese term (文字化け). Example: 'café' stored as UTF-8 but read …
U+0000 (NUL). The first Unicode/ASCII character, used as a string terminator in C/C++. Security risk: null byte injection can truncate …
Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, …
U+FFFD (�). Displayed when a decoder encounters invalid byte sequences — the universal symbol for 'something went wrong with decoding'.
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode scalar value (4 bytes). Iteration via .chars() yields …
A sequence of characters in a programming language. Internal representation varies: UTF-8 (Go, Rust, newer Python builds), UTF-16 (Java, JavaScript, …
The 'length' of a Unicode string depends on the unit: code units (JavaScript .length), code points (Python len()), or grapheme …
Two 16-bit code units (a high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) that together encode a supplementary character in UTF-16. …
Syntax for representing Unicode characters in source code. Varies by language: \u2713 (Python/Java/JS), \u{2713} (JS/Ruby/Rust), \U00012345 (Python/C).
Regex patterns using Unicode properties: \p{L} (any letter), \p{Script=Greek} (Greek script), \p{Emoji}. Support varies by language and regex engine.
Security (10)
Using Unicode bidirectional override characters (U+202A–U+202E, U+2066–U+2069) to disguise malicious file names or code. 'readmefdp.exe' displays as 'readmeexe.pdf'.
Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The 'Trojan Source' attack (CVE-2021-42574) uses bidi overrides to …
Unicode's official term for character pairs that can be visually confused, defined in confusables.txt (UCD). Broader than homoglyphs — includes …
Characters from different scripts that look identical or very similar, such as Latin 'a' vs Cyrillic 'а'. Used in phishing, …
Using visually similar Unicode characters in domain names to impersonate legitimate sites. аpple.com (Cyrillic а) looks like apple.com. Browsers defend …
Identifying text that mixes characters from different scripts (e.g., Latin + Cyrillic). A primary defense against homoglyph attacks; browsers use …
Exploiting Unicode normalization to bypass security filters. Input validated before normalization may change form after: 'fi' (U+FB01) normalizes to 'fi', …
Using Unicode features to deceive users: homoglyphs for fake domains, bidi overrides for fake file extensions, or invisible characters for …
U+200D. Requests that adjacent characters be joined. Critical for emoji sequences (👩+ZWJ+💻=👩💻). In Indic scripts, requests ligature formation. Can also …
U+200C. Prevents joining of adjacent characters. Essential in Persian/Arabic for correct letter forms and used in Devanagari to prevent ligatures.
Emoji (6)
Pictographic Unicode characters originating from Japanese mobile phones. Now 3,790+ emoji across multiple blocks (Emoticons, Misc Symbols & Pictographs, Transport, …
Fitzpatrick scale skin tone modifiers (U+1F3FB–U+1F3FF) that change the skin color of human emoji by being placed immediately after a …
Multi-character emoji constructed by combining base emoji with modifiers, ZWJ characters, or variation selectors. Types include keycap sequences (#️⃣), flag …
Five Fitzpatrick scale modifiers (U+1F3FB–U+1F3FF, 🏻–🏿) that change human emoji skin color. Applied by appending the modifier after a base …
Emoji constructed by joining multiple emoji with Zero Width Joiner (U+200D). 👨👩👧👦 = Man + ZWJ + Woman + ZWJ …
26 characters (U+1F1E6–U+1F1FF, 🇦–🇿) that combine in pairs to form country flag emoji based on ISO 3166-1 country codes. 🇺+🇸 …