What is सामान्यीकरण?

Unicode पाठ को मानक canonical रूप में परिवर्तित करने की प्रक्रिया। चार रूप: NFC (composed), NFD (decomposed), NFKC (compatibility composed), NFKD (compatibility decomposed)।

What is NFC (Canonical Composition)?

Normalization Form C: canonically decompose करें फिर recompose करें, सबसे छोटा रूप उत्पन्न करते हुए। डेटा संग्रहण और विनिमय के लिए अनुशंसित; वेब मानक रूप।

What is विहित तुल्यता?

दो वर्ण अनुक्रम जो शब्दार्थ रूप से समान हैं और उन्हें बराबर माना जाना चाहिए। उदाहरण: é (U+00E9) ≡ e + ◌́ (U+0065 + U+0301)।

एल्गोरिदम

NFD (Canonical Decomposition)

Normalization Form D: बिना recomposing के पूरी तरह decompose करें। macOS HFS+ filesystem द्वारा उपयोग किया जाता है। é (U+00E9) → e + ◌́ (U+0065 + U+0301)।

2022-06-27 · Updated 2024-05-17

NFD: Decomposing Characters to Their Components

NFD (Normalization Form D — Canonical Decomposition) is the form where every composed character is broken down into its constituent parts: a base character followed by one or more combining marks in canonical order. Unlike NFC which recomposes after decomposing, NFD leaves everything decomposed.

For é (U+00E9, LATIN SMALL LETTER E WITH ACUTE), NFD yields the two-code-point sequence e (U+0065) + ́ (U+0301, COMBINING ACUTE ACCENT). For a string like "über", NFD yields u + combining diaeresis + b + e + r — five code points for a four-character word.

When NFD is Used

NFD is the internal form used by Apple's HFS+ and APFS file systems. When macOS writes a filename to disk, it stores it in NFD. This is why filenames created on a Mac can cause issues when transferred to Linux systems: the file café is stored as 5 code points (NFD) on HFS+ but most Linux applications expect 4 code points (NFC). Python's os module will give you the NFD filename on macOS unless you normalize it.

NFD is also useful when you need to strip diacritics:

import unicodedata

def strip_accents(text: str) -> str:
    # Decompose to base + combining marks, then remove all combining marks
    nfd = unicodedata.normalize("NFD", text)
    return "".join(
        c for c in nfd
        if unicodedata.category(c) != "Mn"  # Mn = Mark, Nonspacing
    )

print(strip_accents("naïve"))   # naive
print(strip_accents("résumé"))  # resume
print(strip_accents("über"))    # uber

Combining Mark Order

A subtle but important aspect of NFD: when multiple combining marks follow a base character, Unicode specifies their Canonical Combining Class (CCC) order. Marks with lower CCC values come first (CCC=0 is the base character). The acute accent has CCC=230, the cedilla has CCC=202. NFD ensures combining marks are always in this canonical order, which is necessary for correct canonical equivalence testing.

import unicodedata

# Check combining class of a character
print(unicodedata.combining("\u0301"))  # 230 (COMBINING ACUTE ACCENT)
print(unicodedata.combining("\u0327"))  # 202 (COMBINING CEDILLA)
print(unicodedata.combining("a"))       # 0   (not a combining mark)

Quick Facts

Property	Value
Full name	Normalization Form Canonical Decomposition
Algorithm	Recursive canonical decomposition, then CCC sorting
macOS HFS+	Filenames stored in NFD
Typical use	Diacritic stripping, internal processing, canonical comparison
Python	`unicodedata.normalize("NFD", s)`
Handles compatibility chars?	No
Relation to NFC	NFD then compose = NFC
String length	Equal or longer than NFC (decomposed forms use more code points)

एल्गोरिदम में और

Case Folding

Mapping characters to a common case form for case-insensitive comparison. More comprehensive …

Grapheme Cluster Boundary

Rules (UAX#29) for determining where one user-perceived character ends and another begins. …

NFC (Canonical Composition)

Normalization Form C: canonically decompose करें फिर recompose करें, सबसे छोटा रूप …

NFKC (Compatibility Composition)

Normalization Form KC: compatibility decomposition फिर canonical composition। दृश्य रूप से समान …

NFKD (Compatibility Decomposition)

Normalization Form KD: बिना recomposing के compatibility decomposition। सबसे आक्रामक normalization, सबसे …

String Comparison

Comparing Unicode strings requires normalization (NFC/NFD) and optionally collation (locale-aware sorting). Binary …

द्विदिशीय एल्गोरिदम

मिश्रित-दिशा पाठ (जैसे, English + Arabic) में वर्णों के प्रदर्शन क्रम को …

पंक्ति विराम एल्गोरिदम

यह निर्धारित करने के नियम कि पाठ कहाँ अगली पंक्ति में wrap …

पाठ विभाजन

पाठ में सीमाएँ खोजने के लिए algorithms: grapheme cluster, शब्द, और वाक्य …

वाक्य सीमा

Unicode नियमों के अनुसार वाक्यों के बीच की स्थिति। periods पर विभाजन …

← शब्दावली पर वापस जाएं