安全

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may change form after: 'fi' (U+FB01) normalizes to 'fi', potentially bypassing keyword filters.

What is a Unicode Normalization Attack?

A Unicode normalization attack exploits the fact that the same logical text can be represented in multiple ways in Unicode, and that different parts of a system may apply different normalization rules (or no normalization at all). If a security check is performed on a non-normalized form but the data is later normalized before use, the normalized form can bypass the check.

How Filter Bypass Works

Consider an application that blocks the string <script> to prevent cross-site scripting. If the filter checks the raw input but the database or rendering layer applies NFKC normalization, an attacker can submit:

<script>   (fullwidth less-than U+FF1C, fullwidth greater-than U+FF1E)

The filter sees <script> — no match for <script>. But NFKC normalization maps U+FF1C to U+003C (<) and U+FF1E to U+003E (>), so the database stores or the browser renders <script>, executing the payload.

Similar bypass potential exists with: - The fi ligature (fi, U+FB01) → normalizes to "fi" under NFKD/NFKC - Superscript digits (¹ U+00B9) → normalize to "1" - Roman numerals (Ⅷ U+2167) → normalize to "VIII" - Compatibility characters like ① (U+2460) → normalizes to "1"

Username Normalization Attacks

Many platforms normalize usernames on registration to prevent homoglyph squatting. If the normalization is applied inconsistently, account takeover becomes possible.

A classic scenario: a platform normalizes usernames to NFC on login but stores them as-entered on registration. An attacker registers admin (with a combining character that disappears after NFC normalization), and the login system considers this equivalent to the existing admin account.

Alternatively, if a platform applies NFKC normalization only at display time, an attacker could register ADMINs (fullwidth Latin letters) — visually distinct from ADMINS — and gain a username that maps to the same effective identity after normalization.

Case Folding Attacks

Case folding is Unicode's locale-independent method for case-insensitive comparison, defined in CaseFolding.txt. Inconsistent application creates vulnerabilities:

  • ß (U+00DF) case-folds to ss — a filter blocking SS might miss ß
  • Greek capital sigma Σ (U+03A3) case-folds to σ
  • Turkish dotted Iİ (U+0130) lowercases to i\u0307 in Turkish locale but i in others

If a filter applies str.lower() with the wrong locale, certain characters will not be caught.

WAF Bypass Techniques

Web Application Firewalls (WAFs) that operate on raw bytes before normalization are vulnerable to Unicode-based bypass. Attack patterns include:

  1. Overlong UTF-8 encoding — now invalid in modern systems, but some parsers historically accepted non-minimal encodings
  2. Compatibility decomposition — submitting compatibility characters that decompose to blocked keywords
  3. Mixed NFC/NFD input — deliberately submitting NFD-encoded input to a filter expecting NFC

Defense Strategies

  1. Normalize at the perimeter — apply NFKC normalization to all user input at the earliest entry point, before any security check
  2. Consistent normalization — ensure the same normalization form is applied at input validation, storage, and retrieval
  3. Case folding before comparison — use Unicode case folding, not locale-specific toLowerCase()
  4. Restrict username characters — consider limiting allowed code points to a safe subset (e.g., IdentifierStatus=Allowed from Unicode TR39)

Quick Facts

Attack Type Mechanism
Filter bypass Compatibility chars normalize to blocked strings
Username collision NFC of two different inputs is identical
Case folding Language-specific folding bypasses ASCII-only checks
WAF bypass Submit decomposed/compatibility form, normalized on parsing
Defense NFKC normalize early, apply checks on normalized form
Relevant standard Unicode TR36 (Security Considerations), TR39 (Security Mechanisms)
Key properties IdentifierStatus, IdentifierType (TR39 confusables)

相关术语

安全 中的更多内容

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

IDN 同形字攻击

在域名中使用视觉上相似的Unicode字符来冒充合法网站的攻击,аpple.com(西里尔а)看起来像apple.com,浏览器通过Punycode显示规则加以防范。

Unicode 欺骗

利用Unicode功能欺骗用户:同形字用于假冒域名,双向覆盖用于伪造文件扩展名,不可见字符用于隐藏文本。

双向覆盖攻击

利用Unicode双向覆盖字符(U+202A–U+202E、U+2066–U+2069)伪装恶意文件名或代码的攻击,'readme‮fdp.exe'显示为'readmeexe.pdf'。

同形字

来自不同文字系统但外观相同或非常相似的字符,如拉丁'a'与西里尔'а',用于网络钓鱼、欺骗和社会工程学攻击。

易混淆字符

Unicode对视觉上可能混淆的字符对的官方术语,定义于confusables.txt(UCD),比同形字范围更广,包含仅仅相似而非完全相同的字符。

混合文字系统检测

识别混合不同文字系统字符的文本(如拉丁文+西里尔文),是防御同形字攻击的主要手段,浏览器据此触发Punycode显示。

零宽连接符 (ZWJ)

U+200D,请求相邻字符连接,是表情符号序列的关键(👩+ZWJ+💻=👩‍💻),在印度文字中请求形成连字,也可用于隐藏文本边界。

零宽非连接符 (ZWNJ)

U+200C,阻止相邻字符连接,在波斯语/阿拉伯语中对正确字母形式是必需的,也用于梵文中阻止连字。