Unicode Glossary

150 essential Unicode terms explained — from character encoding fundamentals to security concepts.

Encoding (17)

ASCII

American Standard Code for Information Interchange. 7-bit encoding covering 128 characters (0–127): control characters, digits, Latin letters, and basic symbols.

ASCII Art

Visual art created from text characters, originally limited to the 95 printable ASCII characters. Unicode expands the palette with box-drawing …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, 0–9, +, /). Used for embedding binary data …

Big5

Traditional Chinese character encoding used primarily in Taiwan and Hong Kong, encoding approximately 13,000 CJK characters.

Byte Order Mark (BOM)

U+FEFF placed at the start of a text stream to indicate byte order and encoding. Essential for UTF-16/32, optional and …

Character Encoding

A system that maps characters to byte sequences for digital storage and transmission. Every text file has an encoding — …

EBCDIC

Extended Binary Coded Decimal Interchange Code. IBM mainframe encoding with non-contiguous letter ranges, still used in banking and enterprise mainframes.

EUC-KR

Korean character encoding based on KS X 1001, mapping Hangul syllables and Hanja to double-byte sequences.

GB2312 / GB18030

Simplified Chinese character encoding family: GB2312 (6,763 characters) evolved to GBK and then GB18030, the mandatory Unicode-compatible Chinese national standard.

IANA Charset

Official registry of character encoding names maintained by IANA, used in HTTP Content-Type headers and MIME (e.g., charset=utf-8).

ISO 8859

Family of 8-bit single-byte encodings for different language groups. ISO 8859-1 (Latin-1) was the basis for Unicode's first 256 code …

Shift JIS

Japanese character encoding combining single-byte ASCII/JIS Roman with double-byte JIS X 0208 kanji. Still used in legacy Japanese systems.

UCS-2

Obsolete fixed-length 2-byte encoding covering only the BMP (U+0000–U+FFFF). Predecessor to UTF-16 that cannot represent supplementary characters.

UTF-16

Variable-length Unicode encoding using 2 or 4 bytes (1 or 2 code units of 16 bits). Used internally by Java, …

UTF-32

Fixed-length Unicode encoding using exactly 4 bytes per character. Simple but space-inefficient; used internally by Python 3 (CPython).

UTF-8

Variable-length Unicode encoding using 1–4 bytes per character. The dominant encoding of the web (98%+ of websites) with full ASCII …

Windows-1252

Microsoft's superset of ISO 8859-1, adding smart quotes, em dash, and euro sign in the 0x80–0x9F range. The most common …

Unicode Standard (25)

Abstract Character

A unit of information used for organizing, controlling, or representing textual data — the conceptual entity before it receives a …

Assigned Character

A code point that has been given a character designation in a Unicode version. As of Unicode 16.0, 154,998 code …

Basic Multilingual Plane (BMP)

Plane 0 (U+0000–U+FFFF), containing the most commonly used characters including Latin, Greek, Cyrillic, CJK, Arabic, and most symbols. Characters here …

CJK

Chinese, Japanese, and Korean — the collective term for the unified Han ideograph block and related scripts in Unicode. CJK …

Code Point

A numerical value in the Unicode code space (U+0000 to U+10FFFF), written as U+XXXX. Not all code points are assigned …

Code Space

The complete range of possible Unicode code points: U+0000 to U+10FFFF (1,114,112 total), divided into 17 planes of 65,536 code …

Code Unit

The minimal unit of encoding: an 8-bit byte in UTF-8, a 16-bit word in UTF-16, a 32-bit word in UTF-32. …

Han Unification

The process of mapping Chinese, Japanese, and Korean ideographs that share a common historical origin to a single Unicode code …

Hangul Jamo

The individual consonant and vowel components (jamo) of the Korean Hangul writing system. Unicode encodes both precomposed Hangul syllables (U+AC00–U+D7A3) …

ISO 10646 / Universal Character Set

International standard (ISO/IEC 10646) synchronized with Unicode, defining the same character repertoire and code points but without Unicode's additional algorithms …

Noncharacter

Code points permanently reserved for internal use (66 total): U+FDD0–U+FDEF and U+nFFFE/U+nFFFF for each plane. Valid in text but should …

Plane

A contiguous block of 65,536 code points. Unicode has 17 planes (0–16): Plane 0 is the BMP, Plane 1 is …

Private Use Area (PUA)

Reserved ranges where organizations can assign their own characters: BMP PUA (U+E000–U+F8FF) plus Supplementary PUAs in Planes 15 and 16.

Reserved Code Point

A code point set aside for future standardization, distinct from noncharacters (permanently reserved) and private use areas (user-assignable).

Supplementary Plane / Astral Plane

Planes 1–16 (U+10000–U+10FFFF), containing emoji, historic scripts, CJK extensions, and musical notation. Requires surrogate pairs in UTF-16.

Surrogate

Code points U+D800–U+DFFF reserved exclusively for UTF-16 surrogate pairs. Not valid Unicode scalar values and should never appear as standalone …

Unassigned Code Point

A code point not yet assigned a character in any Unicode version, categorized as Cn (Unassigned). May be assigned in …

Unicode

Universal character encoding standard assigning a unique number (code point) to every character in every writing system. Version 16.0 contains …

Unicode Character Database (UCD)

Machine-readable collection of data files defining all Unicode character properties, including UnicodeData.txt, Blocks.txt, Scripts.txt, and many more.

Unicode Consortium

Non-profit organization that develops and maintains the Unicode Standard. Members include Apple, Google, Microsoft, Meta, and many others.

Unicode Scalar Value

Any code point except surrogate code points (U+D800–U+DFFF). The valid set of values that can represent actual characters, totaling 1,112,064.

Unicode Stability Policy

Guarantee that once a character is assigned, its code point and name never change. Properties may be refined but assignments …

Unicode Standard Annex (UAX)

Normative or informative documents that are integral parts of the Unicode Standard. UAX#9 (Bidi Algorithm), UAX#11 (East Asian Width), UAX#15 …

Unicode Technical Report (UTR)

Informational documents published by the Unicode Consortium covering specific topics like security mechanisms (UTR#39), text segmentation (UTR#29), and line breaking …

Unicode Version

Major releases of the Unicode Standard, each adding new characters, scripts, and features. The current version is Unicode 16.0 (September …

Properties (19)

Age Property

The Unicode version in which a character was first assigned. Useful for determining character support across systems and software versions.

Bidirectional Category

Property determining how a character behaves in bidirectional text (LTR, RTL, weak, neutral). Used by the Unicode Bidirectional Algorithm to …

Block

A named contiguous range of code points (e.g., Basic Latin = U+0000–U+007F). Unicode 16.0 defines 336 blocks; every code point …

Canonical Equivalence

Two character sequences that are semantically identical and should be treated as equal. Example: é (U+00E9) ≡ e + ◌́ …

Case Mapping

The rules for converting characters between uppercase, lowercase, and titlecase. Can be locale-dependent (Turkish I problem) and one-to-many (ß → …

Combining Class

Numeric value (0–254) controlling the ordering of combining marks during canonical decomposition, determining which combining marks can be reordered.

Compatibility Equivalence

Two character sequences with the same abstract content that may differ in appearance. Broader than canonical equivalence. Example: fi ≈ …

Decomposition

The mapping of a character to its component parts. Canonical decomposition preserves meaning (é → e + ́); compatibility decomposition …

Default Ignorable

Characters that should have no visible effect and can be ignored by processes that do not support them, including variation …

East Asian Width

Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or Neutral. Wide characters (CJK ideographs, katakana) occupy two …

Extended Grapheme Cluster

The user-perceived 'character' — what feels like a single unit. May consist of multiple code points (base + combining marks, …

General Category

Classification of every code point into one of 30 categories (Lu, Ll, Nd, So, etc.) grouped into 7 major classes: …

Joining Type

Unicode property controlling how Arabic and Syriac characters connect to adjacent characters. Types include Right_Joining, Left_Joining, Dual_Joining, and Non_Joining.

Mirrored Property

Characters whose glyph should be horizontally mirrored in RTL context. Examples: ( → ), [ → ], { → }, …

Name Alias

Alternative names for characters, since Unicode names cannot change per the stability policy. Used for corrections, abbreviations, and figments.

Numeric Value

The numeric interpretation of a character, if any: digit value (0–9), decimal value, or general numeric value (e.g., ½ = …

Punctuation

Characters used to organize and clarify written language: periods, commas, dashes, quotation marks, and more. Unicode General Category P covers …

Script

The writing system a character belongs to (e.g., Latin, Cyrillic, Han). Unicode 16.0 defines 168 scripts; the Script property is …

Script Extensions

Unicode property listing all scripts that use a character, broader than the single-valued Script property. Common characters like digits have …

Algorithms (15)

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive than lowercasing: German ß → ss, Turkish İ …

Composition Exclusion

Characters excluded from canonical composition (NFC) to prevent non-starter decomposition and ensure algorithmic stability. Listed in CompositionExclusions.txt.

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. Critical for cursor movement, text selection, and correctly …

NFC (Canonical Composition)

Normalization Form C: decompose then recompose canonically, producing the shortest form. Recommended for data storage and interchange; the web standard …

NFD (Canonical Decomposition)

Normalization Form D: fully decompose without recomposing. Used by the macOS HFS+ filesystem. é (U+00E9) → e + ◌́ (U+0065 …

NFKC (Compatibility Composition)

Normalization Form KC: compatibility decomposition then canonical composition. Merges visually similar characters (fi→fi, ²→2, Ⅳ→IV). Used for identifier comparison.

NFKD (Compatibility Decomposition)

Normalization Form KD: compatibility decomposition without recomposing. The most aggressive normalization, losing the most formatting information.

Sentence Boundary

The position between sentences per Unicode rules. More complex than splitting on periods — handles abbreviations (Mr.), ellipsis (...), and …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary comparison of code points alone gives incorrect results …

Unicode Bidirectional Algorithm (UBA)

Algorithm determining display order of characters in mixed-direction text (e.g., English + Arabic), using character bidi categories and explicit directional …

Unicode Collation Algorithm (UCA)

Standard algorithm for comparing and sorting Unicode strings using multi-level comparison: base character → accents → case → tie-breakers. Locale-customizable.

Unicode Line Breaking Algorithm

Rules for determining where text can wrap to the next line, considering character properties, CJK word boundaries, and break opportunities.

Unicode Normalization

Process of converting Unicode text to a standard canonical form. Four forms: NFC (composed), NFD (decomposed), NFKC (compatibility composed), NFKD …

Unicode Text Segmentation

Algorithms for finding boundaries in text: grapheme cluster, word, and sentence boundaries. Critical for cursor movement, text selection, and text …

Word Boundary

The position between words as determined by Unicode word break rules. Not a simple split on spaces — handles CJK …

Typography (19)

Combining Character

A character that attaches to the preceding base character to modify it. General Category: Mn (nonspacing), Mc (spacing combining), Me …

CSS unicode-range

CSS @font-face descriptor specifying which Unicode code points a font should cover. Enables downloading only the font subset needed for …

Dash

Punctuation marks used to separate parts of a sentence or indicate ranges. Unicode defines multiple dashes: hyphen (‐), en dash …

Diacritical Mark / Diacritic

A mark added to a letter to change pronunciation or meaning. Can be precomposed (é U+00E9) or combining (e + …

Ellipsis

U+2026 HORIZONTAL ELLIPSIS (…). A single character replacing three periods, typographically correct and counting as 1 character instead of 3.

Em / En (Typographic Units)

Em: a width equal to the font size. En: half an em. Used to define em dash width, em space, …

Font

A specific implementation of a typeface at a particular size, weight, and style. In digital typography, a font file (TTF, …

Font Fallback

The mechanism by which a rendering engine substitutes glyphs from a secondary font when the primary font lacks coverage for …

Glyph

The visual representation of a character as rendered by a font. One character may have multiple glyphs (ligatures, contextual forms); …

Kerning

Adjusting the spacing between specific character pairs for visual harmony (e.g., AV, To, LT). A font feature, not a Unicode …

Ligature

Two or more characters joined into a single glyph. Can be typographic (fi → fi via OpenType) or a Unicode …

Non-Breaking Space

U+00A0. A space that prevents line breaking at its position. HTML:  . Used between numbers and units (100 km), in …

OpenType

Modern font format developed by Microsoft and Adobe supporting up to 65,535 glyphs, advanced typographic features (ligatures, alternates, kerning), and …

Quotation Mark

Paired punctuation marks enclosing direct speech or quotations. Unicode includes straight (""), curly (“”), guillemets (« »), CJK corner brackets …

RTL (Right-to-Left)

Text directionality where characters flow from right to left. Used by Arabic, Hebrew, Thaana, and other scripts; requires the Bidirectional …

Small Caps

Uppercase letterforms at the height of lowercase letters. CSS: font-variant: small-caps. Unicode also has actual small capital letters in Latin …

Web Fonts

Fonts downloaded by the browser to render text, declared via CSS @font-face. WOFF2 is the standard format. Unicode subsetting and …

Whitespace Character

Characters that represent horizontal or vertical space but have no visible glyph. Unicode defines 17+ whitespace characters with different widths …

Zero Width Character

Characters with zero advance width — invisible in rendering but affecting text behavior. Includes ZWSP (word break), ZWJ (join), ZWNJ …

Input Methods (9)

Web & HTML (16)

Content-Type Charset

HTTP header parameter declaring the character encoding of a response (Content-Type: text/html; charset=utf-8). Overrides any in-document encoding declaration.

CSS Content Property

CSS property inserting generated content via ::before and ::after pseudo-elements using Unicode escapes: content: "\2713" inserts ✓.

CSS Text Direction

CSS properties (direction, writing-mode, unicode-bidi) controlling text layout direction. Works with Unicode Bidi Algorithm for mixed LTR/RTL content in web …

Emoji Presentation

Rendering a character with a colorful emoji glyph, typically using Variation Selector 16 (U+FE0F). Some characters default to emoji presentation, …

HTML Entity

A textual representation of a character in HTML. Three forms: named (&), decimal (&), hexadecimal (&). Essential for characters that …

Internationalized Domain Name (IDN)

Domain names containing non-ASCII Unicode characters, internally stored as Punycode (xn--...) but displayed in Unicode to users. Security concern: homograph …

JavaScript Intl API

ECMAScript Internationalization API providing locale-aware string comparison (Collator), number formatting (NumberFormat), date formatting (DateTimeFormat), and segmentation (Segmenter).

Named Character Reference

HTML entity using a human-readable name: © → ©, — → —. HTML5 defines 2,231 named references; they are case-sensitive.

Numeric Character Reference

HTML entity using the Unicode code point number: decimal (© → ©) or hexadecimal (© → ©). Works for any …

Percent-Encoding (URL Encoding)

Encoding non-ASCII and reserved characters in URLs by replacing each byte with %XX. UTF-8 is used first, then each byte …

Punycode

ASCII-compatible encoding of Unicode domain names, converting internationalized labels to xn-- prefixed ASCII strings. münchen.de → xn--mnchen-3ya.de.

Text Presentation

Rendering a character with a plain monochrome text glyph rather than a colorful emoji, typically using Variation Selector 15 (U+FE0E) …

Unicode in CSS

CSS supports Unicode via escape sequences (\2713 for ✓), the content property for generated text, unicode-range for font subsetting, and …

Variation Selector

Characters (U+FE00–U+FE0F, U+E0100–U+E01EF) that select a specific glyph variant. VS15 (U+FE0E) = text presentation, VS16 (U+FE0F) = emoji presentation.

Word Joiner

U+2060. A zero-width character that prevents line breaking. The modern replacement for U+FEFF (BOM) as a zero-width no-break space.

XML Character Reference

XML's version of numeric character references: ✓ or ✓. XML has only 5 named entities (& < > " '), …

Programming & Development (13)

Encoding / Decoding

Encoding converts characters to bytes (str.encode('utf-8')); decoding converts bytes to characters (bytes.decode('utf-8')). Getting this right prevents mojibake.

Invisible Character

Any character with no visible glyph: whitespace, zero-width characters, control characters, and formatting characters. Can cause security issues such as …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary characters, use codePointAt() and Character.toChars(). Java's \uXXXX escapes …

Mojibake

Garbled text resulting from decoding bytes with the wrong encoding. Japanese term (文字化け). Example: 'café' stored as UTF-8 but read …

Null Character

U+0000 (NUL). The first Unicode/ASCII character, used as a string terminator in C/C++. Security risk: null byte injection can truncate …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via PEP 393). Key features: \N{name} escapes, unicodedata module, …

Replacement Character

U+FFFD (�). Displayed when a decoder encounters invalid byte sequences — the universal symbol for 'something went wrong with decoding'.

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode scalar value (4 bytes). Iteration via .chars() yields …

String

A sequence of characters in a programming language. Internal representation varies: UTF-8 (Go, Rust, newer Python builds), UTF-16 (Java, JavaScript, …

String Length Ambiguity

The 'length' of a Unicode string depends on the unit: code units (JavaScript .length), code points (Python len()), or grapheme …

Surrogate Pair

Two 16-bit code units (a high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) that together encode a supplementary character in UTF-16. …

Unicode Escape Sequence

Syntax for representing Unicode characters in source code. Varies by language: \u2713 (Python/Java/JS), \u{2713} (JS/Ruby/Rust), \U00012345 (Python/C).

Unicode Regular Expression

Regex patterns using Unicode properties: \p{L} (any letter), \p{Script=Greek} (Greek script), \p{Emoji}. Support varies by language and regex engine.

Security (10)

Bidi Override Attack

Using Unicode bidirectional override characters (U+202A–U+202E, U+2066–U+2069) to disguise malicious file names or code. 'readme‮fdp.exe' displays as 'readmeexe.pdf'.

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The 'Trojan Source' attack (CVE-2021-42574) uses bidi overrides to …

Confusable

Unicode's official term for character pairs that can be visually confused, defined in confusables.txt (UCD). Broader than homoglyphs — includes …

Homoglyph

Characters from different scripts that look identical or very similar, such as Latin 'a' vs Cyrillic 'а'. Used in phishing, …

IDN Homograph Attack

Using visually similar Unicode characters in domain names to impersonate legitimate sites. аpple.com (Cyrillic а) looks like apple.com. Browsers defend …

Mixed-Script Detection

Identifying text that mixes characters from different scripts (e.g., Latin + Cyrillic). A primary defense against homoglyph attacks; browsers use …

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may change form after: 'fi' (U+FB01) normalizes to 'fi', …

Unicode Spoofing

Using Unicode features to deceive users: homoglyphs for fake domains, bidi overrides for fake file extensions, or invisible characters for …

Zero Width Joiner (ZWJ)

U+200D. Requests that adjacent characters be joined. Critical for emoji sequences (👩+ZWJ+💻=👩‍💻). In Indic scripts, requests ligature formation. Can also …

Zero Width Non-Joiner (ZWNJ)

U+200C. Prevents joining of adjacent characters. Essential in Persian/Arabic for correct letter forms and used in Devanagari to prevent ligatures.

Emoji (6)

Miscellaneous (1)