🔣 Symbol Reference

Combining Characters and Diacritics Guide

Diacritics are accent marks and other marks that attach to letters to indicate pronunciation or meaning, represented in Unicode as combining characters that follow their base character. This guide explains how combining diacritics work, lists the most common ones, and explains the difference between precomposed and decomposed forms.

·

Combining characters and diacritics are one of Unicode's most elegant and complex features. Rather than encoding every possible accented letter as a separate code point, Unicode allows base characters to be modified by zero-width combining marks that attach visually to the preceding character. Understanding this system is essential for anyone working with multilingual text processing.

What Are Combining Characters?

A combining character is a Unicode code point that is not displayed on its own but instead modifies the character that immediately precedes it. The Combining Diacritical Marks block spans U+0300–U+036F and contains the most common combining marks used in Latin, Greek, and related scripts.

For example, the letter é can be represented in two distinct ways: - Precomposed: U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — a single code point - Decomposed: U+0065 (e) + U+0301 (COMBINING ACUTE ACCENT) — two code points

Both render identically in a properly implemented text renderer, but they are different byte sequences and will not match in a naive string comparison.

Common Combining Marks

Symbol Code Point Name Example
◌̀ U+0300 COMBINING GRAVE ACCENT à
◌́ U+0301 COMBINING ACUTE ACCENT á
◌̂ U+0302 COMBINING CIRCUMFLEX ACCENT â
◌̃ U+0303 COMBINING TILDE ã
◌̈ U+0308 COMBINING DIAERESIS (Umlaut) ä
◌̧ U+0327 COMBINING CEDILLA ç
◌̄ U+0304 COMBINING MACRON ā
◌̊ U+030A COMBINING RING ABOVE å
◌̋ U+030B COMBINING DOUBLE ACUTE ACCENT ő
◌̌ U+030C COMBINING CARON (Háček) š

Unicode Normalization: NFC vs NFD

The existence of both composed and decomposed representations creates a real-world problem: two strings that look identical may not compare as equal. Unicode Normalization Forms were created to address this:

Form Name Description
NFC Canonical Decomposition, then Canonical Composition Precomposed where possible
NFD Canonical Decomposition Fully decomposed into base + combining marks
NFKC Compatibility Decomposition, then Canonical Composition Also normalizes compatibility variants
NFKD Compatibility Decomposition Fully decomposed including compatibility

NFC is the recommended form for storage and interchange — it produces shorter strings for most Latin-script text while still being semantically unambiguous. NFD is useful when you need to process or strip diacritics: decompose the string, then filter out all code points in the Combining Diacritical Marks category.

In Python, normalization is straightforward:

# import unicodedata
# text = "caf\\u00e9"              # precomposed é
# nfd = unicodedata.normalize("NFD", text)
# # nfd now contains: c + a + f + e + combining acute
# stripped = "".join(c for c in nfd if unicodedata.category(c) != "Mn")
# # stripped = "cafe"

Precomposed vs Decomposed in Practice

Most modern operating systems store and transmit text in NFC. macOS is a notable exception for its HFS+ and APFS filesystems, which historically stored filenames in NFD. This caused notorious problems when transferring files between macOS and Linux systems — a filename containing é would be stored with different byte sequences on the two systems, causing apparent duplicates or missing files.

The practical rule: always normalize to a consistent form before comparing, storing, or hashing text that may contain diacritics.

Stacking Combining Marks

Unicode allows multiple combining marks to be stacked on a single base character. The rendering order matters: combining marks are applied from innermost (closest to the base) to outermost. The Unicode standard defines canonical ordering for combining marks using a "Canonical Combining Class" (CCC) value:

  • CCC 0: Spacing, resets stacking
  • CCC 1–199: Various positional classes
  • CCC 230: Most above-base marks (acute, grave, circumflex)
  • CCC 220: Most below-base marks (cedilla, ring below)

When two combining marks have different CCC values, they can be reordered by normalization. When they have the same CCC value, their order is significant and must be preserved.

The Zalgo Text Phenomenon

Zalgo text exploits Unicode's combining character stacking to create intentionally unreadable, glitchy-looking text by stacking dozens or hundreds of combining marks on a single base character. The result looks like corrupted or "cursed" text:

H̴̨̡̘̮̙̼͈͒̒͌̈́̔͘͝ę̶̛̥̫̩̊̾̃͘l̷̡̳̙͕̗̓̈́͛l̴̢̨̖̗̦̋̏͜o̵̧̟̺̹͈̎̒̚͝

Each visible letter has dozens of combining marks attached, far exceeding what any font was designed to handle. While browsers and text renderers have become more resilient to Zalgo text over time, it remains a useful reminder that text processing code must account for grapheme clusters (base character + all its combining marks) rather than individual code points when measuring string length or splitting text.

Diacritics in Language Processing

For natural language processing tasks like text search and spell checking, handling diacritics correctly is critical:

  • Accent-insensitive search: Normalize both query and corpus to NFD, then strip marks
  • Spell checking: Use NFC forms matching dictionary entries
  • Sort order: Many languages sort accented variants near their base letter; others treat them as distinct letters (e.g., Swedish sorts Å after Z)
  • Transliteration: Converting diacritics to ASCII approximations requires language-specific rules, not just stripping marks

The Combining Diacritical Marks Supplement (U+1DC0–U+1DFF) and Combining Diacritical Marks Extended (U+1AB0–U+1AFF) blocks contain additional marks used in phonetic transcription and historical linguistics.

เพิ่มเติมใน Symbol Reference

Complete Arrow Symbols List

Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, …

All Check Mark and Tick Symbols

Unicode provides multiple check mark and tick symbols ranging from the classic …

Star and Asterisk Symbols

Unicode includes a rich collection of star shapes — from the simple …

Heart Symbols Complete Guide

Unicode contains dozens of heart symbols including the classic ♥, black and …

Currency Symbols Around the World

Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over …

Mathematical Symbols and Operators

Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric …

Bracket and Parenthesis Symbols

Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly …

Bullet Point Symbols

Unicode offers a wide variety of bullet point characters beyond the standard …

Line and Box Drawing Characters

Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, …

Musical Note Symbols

Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols …

Fraction Symbols Guide

Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ …

Superscript and Subscript Characters

Unicode provides precomposed superscript and subscript digits and letters — such as …

Circle Symbols

Unicode contains dozens of circle symbols including filled circles, outlined circles, circles …

Square and Rectangle Symbols

Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, …

Triangle Symbols

Unicode provides a comprehensive set of triangle symbols in all orientations — …

Diamond Symbols

Unicode includes filled and outline diamond shapes, lozenge characters, and playing card …

Cross and X Mark Symbols

Unicode provides various cross and X mark characters including the heavy ballot …

Dash and Hyphen Symbols Guide

The hyphen-minus on your keyboard is just one of Unicode's many dash …

Quotation Mark Symbols Complete Guide

Unicode defines typographic quotation marks — curly quotes — for dozens of …

Copyright, Trademark & Legal Symbols

Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, …

Degree and Temperature Symbols

The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ …

Circled and Enclosed Number Symbols

Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and …

Roman Numeral Symbols

Unicode includes a Number Forms block with precomposed Roman numeral characters such …

Greek Alphabet Symbols for Math and Science

Greek letters like α β γ δ π Σ Ω are widely …

Decorative Dingbats

The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the …

Playing Card Symbols

Unicode includes a Playing Cards block with characters for all 52 standard …

Chess Piece Symbols

Unicode provides characters for all six chess piece types in both white …

Zodiac and Astrological Symbols

Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, …

Braille Pattern Characters

Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the …

Geometric Shapes Complete Guide

Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, …

Letterlike Symbols

The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from …

Technical Symbols Guide

Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including …

Whitespace and Invisible Characters Guide

Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width …

Warning and Hazard Signs

Unicode includes warning and hazard symbols such as the universal caution ⚠ …

Weather Symbols Guide

Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow …

Religious Symbols in Unicode

Unicode includes symbols for many of the world's major religions including the …

Gender and Identity Symbols

Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, …

Keyboard Shortcut Symbols Guide

Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ …

Symbols for Social Media Bios

Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds …