What is การทำให้เป็นมาตรฐาน?

กระบวนการแปลงข้อความ Unicode เป็นรูปแบบ canonical มาตรฐาน มี 4 รูปแบบ: NFC (รวม), NFD (แยก), NFKC (compatibility รวม), NFKD (compatibility แยก)

What is NFKC (Compatibility Composition)?

Normalization Form KC: แยกส่วนแบบ compatibility แล้วรวมแบบ canonical รวมอักขระที่มีลักษณะคล้ายกัน (ﬁ→fi, ²→2, Ⅳ→IV) ใช้สำหรับการเปรียบเทียบตัวระบุ

What is NFKD (Compatibility Decomposition)?

Normalization Form KD: แยกส่วนแบบ compatibility โดยไม่รวมใหม่ เป็นการ normalize ที่เข้มงวดที่สุด สูญเสียข้อมูลการจัดรูปแบบมากที่สุด

ความปลอดภัย

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may change form after: 'ﬁ' (U+FB01) normalizes to 'fi', potentially bypassing keyword filters.

What is a Unicode Normalization Attack?

A Unicode normalization attack exploits the fact that the same logical text can be represented in multiple ways in Unicode, and that different parts of a system may apply different normalization rules (or no normalization at all). If a security check is performed on a non-normalized form but the data is later normalized before use, the normalized form can bypass the check.

How Filter Bypass Works

Consider an application that blocks the string <script> to prevent cross-site scripting. If the filter checks the raw input but the database or rendering layer applies NFKC normalization, an attacker can submit:

＜script＞   (fullwidth less-than U+FF1C, fullwidth greater-than U+FF1E)

The filter sees ＜script＞ — no match for <script>. But NFKC normalization maps U+FF1C to U+003C (<) and U+FF1E to U+003E (>), so the database stores or the browser renders <script>, executing the payload.

Similar bypass potential exists with: - The fi ligature (ﬁ, U+FB01) → normalizes to "fi" under NFKD/NFKC - Superscript digits (¹ U+00B9) → normalize to "1" - Roman numerals (Ⅷ U+2167) → normalize to "VIII" - Compatibility characters like ① (U+2460) → normalizes to "1"

Username Normalization Attacks

Many platforms normalize usernames on registration to prevent homoglyph squatting. If the normalization is applied inconsistently, account takeover becomes possible.

A classic scenario: a platform normalizes usernames to NFC on login but stores them as-entered on registration. An attacker registers admin (with a combining character that disappears after NFC normalization), and the login system considers this equivalent to the existing admin account.

Alternatively, if a platform applies NFKC normalization only at display time, an attacker could register ＡＤＭＩＮs (fullwidth Latin letters) — visually distinct from ADMINS — and gain a username that maps to the same effective identity after normalization.

Case Folding Attacks

Case folding is Unicode's locale-independent method for case-insensitive comparison, defined in CaseFolding.txt. Inconsistent application creates vulnerabilities:

ß (U+00DF) case-folds to ss — a filter blocking SS might miss ß
Greek capital sigma Σ (U+03A3) case-folds to σ
Turkish dotted I — İ (U+0130) lowercases to i\u0307 in Turkish locale but i in others

If a filter applies str.lower() with the wrong locale, certain characters will not be caught.

WAF Bypass Techniques

Web Application Firewalls (WAFs) that operate on raw bytes before normalization are vulnerable to Unicode-based bypass. Attack patterns include:

Overlong UTF-8 encoding — now invalid in modern systems, but some parsers historically accepted non-minimal encodings
Compatibility decomposition — submitting compatibility characters that decompose to blocked keywords
Mixed NFC/NFD input — deliberately submitting NFD-encoded input to a filter expecting NFC

Defense Strategies

Normalize at the perimeter — apply NFKC normalization to all user input at the earliest entry point, before any security check
Consistent normalization — ensure the same normalization form is applied at input validation, storage, and retrieval
Case folding before comparison — use Unicode case folding, not locale-specific toLowerCase()
Restrict username characters — consider limiting allowed code points to a safe subset (e.g., IdentifierStatus=Allowed from Unicode TR39)

Quick Facts

Attack Type	Mechanism
Filter bypass	Compatibility chars normalize to blocked strings
Username collision	NFC of two different inputs is identical
Case folding	Language-specific folding bypasses ASCII-only checks
WAF bypass	Submit decomposed/compatibility form, normalized on parsing
Defense	NFKC normalize early, apply checks on normalized form
Relevant standard	Unicode TR36 (Security Considerations), TR39 (Security Mechanisms)
Key properties	`IdentifierStatus`, `IdentifierType` (TR39 confusables)

คำศัพท์ที่เกี่ยวข้อง

การทำให้เป็นมาตรฐาน NFKC (Compatibility Composition) NFKD (Compatibility Decomposition) การปลอมแปลง Unicode

เพิ่มเติมใน ความปลอดภัย

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

Zero Width Joiner (ZWJ)

U+200D ร้องขอให้อักขระที่อยู่ติดกันถูกเชื่อมต่อ สำคัญสำหรับลำดับ emoji (👩+ZWJ+💻=👩‍💻) ในอักษรอินดิก จะขอให้สร้างตัวเชื่อม อาจซ่อนขอบเขตข้อความได้ด้วย

Zero Width Non-Joiner (ZWNJ)

U+200C ป้องกันการเชื่อมต่ออักขระที่อยู่ติดกัน จำเป็นสำหรับภาษาเปอร์เซีย/อาหรับเพื่อรูปแบบตัวอักษรที่ถูกต้อง และใช้ใน Devanagari เพื่อป้องกันการสร้างตัวเชื่อม

การตรวจจับสคริปต์ผสม

การระบุข้อความที่ผสมอักขระจากอักษรต่างกัน (เช่น ละตินผสมซีริลลิก) เป็นการป้องกันหลักจากการโจมตีแบบ homoglyph เบราว์เซอร์ใช้สิ่งนี้เพื่อเปิดใช้งานการแสดง Punycode

การปลอมแปลง Unicode

การใช้คุณสมบัติ Unicode เพื่อหลอกลวงผู้ใช้: homoglyph สำหรับโดเมนปลอม, bidi override สำหรับนามสกุลไฟล์ปลอม หรืออักขระที่มองไม่เห็นสำหรับข้อความที่ซ่อนอยู่

การโจมตี bidi override

การใช้อักขระ bidirectional override ของ Unicode (U+202A–U+202E, U+2066–U+2069) เพื่อปลอมแปลงชื่อไฟล์หรือโค้ดที่เป็นอันตราย 'readme‮fdp.exe' แสดงเป็น 'readmeexe.pdf'

การโจมตี IDN homograph

การใช้อักขระ Unicode ที่มีลักษณะคล้ายกันในชื่อโดเมนเพื่อปลอมแปลงเป็นเว็บไซต์ที่ถูกต้อง аpple.com (Cyrillic а) ดูเหมือน apple.com เบราว์เซอร์ป้องกันด้วยกฎการแสดง Punycode

อักขระที่สับสนได้

คำศัพท์อย่างเป็นทางการของ Unicode สำหรับคู่อักขระที่อาจสับสนได้ทางสายตา กำหนดไว้ใน confusables.txt (UCD) กว้างกว่า homoglyph ครอบคลุมอักขระที่เพียงแค่คล้ายกัน ไม่ใช่แค่เหมือนกัน

โฮโมไกลฟ์

อักขระจากอักษรต่างกันที่มีลักษณะเหมือนกันหรือคล้ายกันมาก เช่น 'a' ภาษาละตินกับ 'а' ภาษาซีริลลิก ใช้ในการโจมตีแบบ phishing, การปลอมแปลง และวิศวกรรมสังคม

← กลับไปยังอภิธานศัพท์