Combining Characters and Diacritics Guide
Diacritics are accent marks and other marks that attach to letters to indicate pronunciation or meaning, represented in Unicode as combining characters that follow their base character. This guide explains how combining diacritics work, lists the most common ones, and explains the difference between precomposed and decomposed forms.
Combining characters and diacritics are one of Unicode's most elegant and complex features. Rather than encoding every possible accented letter as a separate code point, Unicode allows base characters to be modified by zero-width combining marks that attach visually to the preceding character. Understanding this system is essential for anyone working with multilingual text processing.
What Are Combining Characters?
A combining character is a Unicode code point that is not displayed on its own but instead modifies the character that immediately precedes it. The Combining Diacritical Marks block spans U+0300–U+036F and contains the most common combining marks used in Latin, Greek, and related scripts.
For example, the letter é can be represented in two distinct ways:
- Precomposed: U+00E9 (LATIN SMALL LETTER E WITH ACUTE) — a single code point
- Decomposed: U+0065 (e) + U+0301 (COMBINING ACUTE ACCENT) — two code points
Both render identically in a properly implemented text renderer, but they are different byte sequences and will not match in a naive string comparison.
Common Combining Marks
| Symbol | Code Point | Name | Example |
|---|---|---|---|
| ◌̀ | U+0300 | COMBINING GRAVE ACCENT | à |
| ◌́ | U+0301 | COMBINING ACUTE ACCENT | á |
| ◌̂ | U+0302 | COMBINING CIRCUMFLEX ACCENT | â |
| ◌̃ | U+0303 | COMBINING TILDE | ã |
| ◌̈ | U+0308 | COMBINING DIAERESIS (Umlaut) | ä |
| ◌̧ | U+0327 | COMBINING CEDILLA | ç |
| ◌̄ | U+0304 | COMBINING MACRON | ā |
| ◌̊ | U+030A | COMBINING RING ABOVE | å |
| ◌̋ | U+030B | COMBINING DOUBLE ACUTE ACCENT | ő |
| ◌̌ | U+030C | COMBINING CARON (Háček) | š |
Unicode Normalization: NFC vs NFD
The existence of both composed and decomposed representations creates a real-world problem: two strings that look identical may not compare as equal. Unicode Normalization Forms were created to address this:
| Form | Name | Description |
|---|---|---|
| NFC | Canonical Decomposition, then Canonical Composition | Precomposed where possible |
| NFD | Canonical Decomposition | Fully decomposed into base + combining marks |
| NFKC | Compatibility Decomposition, then Canonical Composition | Also normalizes compatibility variants |
| NFKD | Compatibility Decomposition | Fully decomposed including compatibility |
NFC is the recommended form for storage and interchange — it produces shorter strings for most Latin-script text while still being semantically unambiguous. NFD is useful when you need to process or strip diacritics: decompose the string, then filter out all code points in the Combining Diacritical Marks category.
In Python, normalization is straightforward:
# import unicodedata
# text = "caf\\u00e9" # precomposed é
# nfd = unicodedata.normalize("NFD", text)
# # nfd now contains: c + a + f + e + combining acute
# stripped = "".join(c for c in nfd if unicodedata.category(c) != "Mn")
# # stripped = "cafe"
Precomposed vs Decomposed in Practice
Most modern operating systems store and transmit text in NFC. macOS is a notable exception for its HFS+ and APFS filesystems, which historically stored filenames in NFD. This caused notorious problems when transferring files between macOS and Linux systems — a filename containing é would be stored with different byte sequences on the two systems, causing apparent duplicates or missing files.
The practical rule: always normalize to a consistent form before comparing, storing, or hashing text that may contain diacritics.
Stacking Combining Marks
Unicode allows multiple combining marks to be stacked on a single base character. The rendering order matters: combining marks are applied from innermost (closest to the base) to outermost. The Unicode standard defines canonical ordering for combining marks using a "Canonical Combining Class" (CCC) value:
- CCC 0: Spacing, resets stacking
- CCC 1–199: Various positional classes
- CCC 230: Most above-base marks (acute, grave, circumflex)
- CCC 220: Most below-base marks (cedilla, ring below)
When two combining marks have different CCC values, they can be reordered by normalization. When they have the same CCC value, their order is significant and must be preserved.
The Zalgo Text Phenomenon
Zalgo text exploits Unicode's combining character stacking to create intentionally unreadable, glitchy-looking text by stacking dozens or hundreds of combining marks on a single base character. The result looks like corrupted or "cursed" text:
H̴̨̡̘̮̙̼͈͒̒͌̈́̔͘͝ę̶̛̥̫̩̊̾̃͘l̷̡̳̙͕̗̓̈́͛l̴̢̨̖̗̦̋̏͜o̵̧̟̺̹͈̎̒̚͝
Each visible letter has dozens of combining marks attached, far exceeding what any font was designed to handle. While browsers and text renderers have become more resilient to Zalgo text over time, it remains a useful reminder that text processing code must account for grapheme clusters (base character + all its combining marks) rather than individual code points when measuring string length or splitting text.
Diacritics in Language Processing
For natural language processing tasks like text search and spell checking, handling diacritics correctly is critical:
- Accent-insensitive search: Normalize both query and corpus to NFD, then strip marks
- Spell checking: Use NFC forms matching dictionary entries
- Sort order: Many languages sort accented variants near their base letter; others treat them as distinct letters (e.g., Swedish sorts Å after Z)
- Transliteration: Converting diacritics to ASCII approximations requires language-specific rules, not just stripping marks
The Combining Diacritical Marks Supplement (U+1DC0–U+1DFF) and Combining Diacritical Marks Extended (U+1AB0–U+1AFF) blocks contain additional marks used in phonetic transcription and historical linguistics.
เพิ่มเติมใน Symbol Reference
Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, …
Unicode provides multiple check mark and tick symbols ranging from the classic …
Unicode includes a rich collection of star shapes — from the simple …
Unicode contains dozens of heart symbols including the classic ♥, black and …
Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over …
Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric …
Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly …
Unicode offers a wide variety of bullet point characters beyond the standard …
Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, …
Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols …
Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ …
Unicode provides precomposed superscript and subscript digits and letters — such as …
Unicode contains dozens of circle symbols including filled circles, outlined circles, circles …
Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, …
Unicode provides a comprehensive set of triangle symbols in all orientations — …
Unicode includes filled and outline diamond shapes, lozenge characters, and playing card …
Unicode provides various cross and X mark characters including the heavy ballot …
The hyphen-minus on your keyboard is just one of Unicode's many dash …
Unicode defines typographic quotation marks — curly quotes — for dozens of …
Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, …
The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ …
Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and …
Unicode includes a Number Forms block with precomposed Roman numeral characters such …
Greek letters like α β γ δ π Σ Ω are widely …
The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the …
Unicode includes a Playing Cards block with characters for all 52 standard …
Unicode provides characters for all six chess piece types in both white …
Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, …
Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the …
Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, …
The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from …
Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including …
Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width …
Unicode includes warning and hazard symbols such as the universal caution ⚠ …
Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow …
Unicode includes symbols for many of the world's major religions including the …
Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, …
Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ …
Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds …