Normalization Attack
Exploiting Unicode normalization to bypass security filters. Input validated before normalization may change form after: 'fi' (U+FB01) normalizes to 'fi', potentially bypassing keyword filters.
What is a Unicode Normalization Attack?
A Unicode normalization attack exploits the fact that the same logical text can be represented in multiple ways in Unicode, and that different parts of a system may apply different normalization rules (or no normalization at all). If a security check is performed on a non-normalized form but the data is later normalized before use, the normalized form can bypass the check.
How Filter Bypass Works
Consider an application that blocks the string <script> to prevent cross-site scripting. If the filter checks the raw input but the database or rendering layer applies NFKC normalization, an attacker can submit:
<script> (fullwidth less-than U+FF1C, fullwidth greater-than U+FF1E)
The filter sees <script> — no match for <script>. But NFKC normalization maps U+FF1C to U+003C (<) and U+FF1E to U+003E (>), so the database stores or the browser renders <script>, executing the payload.
Similar bypass potential exists with: - The fi ligature (fi, U+FB01) → normalizes to "fi" under NFKD/NFKC - Superscript digits (¹ U+00B9) → normalize to "1" - Roman numerals (Ⅷ U+2167) → normalize to "VIII" - Compatibility characters like ① (U+2460) → normalizes to "1"
Username Normalization Attacks
Many platforms normalize usernames on registration to prevent homoglyph squatting. If the normalization is applied inconsistently, account takeover becomes possible.
A classic scenario: a platform normalizes usernames to NFC on login but stores them as-entered on registration. An attacker registers admin (with a combining character that disappears after NFC normalization), and the login system considers this equivalent to the existing admin account.
Alternatively, if a platform applies NFKC normalization only at display time, an attacker could register ADMINs (fullwidth Latin letters) — visually distinct from ADMINS — and gain a username that maps to the same effective identity after normalization.
Case Folding Attacks
Case folding is Unicode's locale-independent method for case-insensitive comparison, defined in CaseFolding.txt. Inconsistent application creates vulnerabilities:
- ß (U+00DF) case-folds to
ss— a filter blockingSSmight missß - Greek capital sigma Σ (U+03A3) case-folds to
σ - Turkish dotted I —
İ(U+0130) lowercases toi\u0307in Turkish locale butiin others
If a filter applies str.lower() with the wrong locale, certain characters will not be caught.
WAF Bypass Techniques
Web Application Firewalls (WAFs) that operate on raw bytes before normalization are vulnerable to Unicode-based bypass. Attack patterns include:
- Overlong UTF-8 encoding — now invalid in modern systems, but some parsers historically accepted non-minimal encodings
- Compatibility decomposition — submitting compatibility characters that decompose to blocked keywords
- Mixed NFC/NFD input — deliberately submitting NFD-encoded input to a filter expecting NFC
Defense Strategies
- Normalize at the perimeter — apply NFKC normalization to all user input at the earliest entry point, before any security check
- Consistent normalization — ensure the same normalization form is applied at input validation, storage, and retrieval
- Case folding before comparison — use Unicode case folding, not locale-specific
toLowerCase() - Restrict username characters — consider limiting allowed code points to a safe subset (e.g., IdentifierStatus=Allowed from Unicode TR39)
Quick Facts
| Attack Type | Mechanism |
|---|---|
| Filter bypass | Compatibility chars normalize to blocked strings |
| Username collision | NFC of two different inputs is identical |
| Case folding | Language-specific folding bypasses ASCII-only checks |
| WAF bypass | Submit decomposed/compatibility form, normalized on parsing |
| Defense | NFKC normalize early, apply checks on normalized form |
| Relevant standard | Unicode TR36 (Security Considerations), TR39 (Security Mechanisms) |
| Key properties | IdentifierStatus, IdentifierType (TR39 confusables) |
関連用語
セキュリティ のその他の用語
Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …
Unicode双方向オーバーライド文字(U+202A〜U+202E・U+2066〜U+2069)を使って悪意のあるファイル名やコードを偽装する攻撃。'readmefdp.exe'は'readmeexe.pdf'と表示されます。
ドメイン名に視覚的に似たUnicode文字を使って正規サイトになりすます攻撃。аpple.com(キリルа)はapple.comに見えます。ブラウザはPunycodeの表示ルールで防御します。
Unicode機能を使ってユーザーを欺くこと:偽ドメインのためのホモグリフ・偽ファイル拡張子のためのBidiオーバーライド・隠しテキストのための不可視文字。
U+200D。隣接する文字の結合を要求します。絵文字シーケンスに不可欠です(👩+ZWJ+💻=👩💻)。インド系文字では合字形成を要求します。テキスト境界を隠すためにも使われます。
U+200C。隣接する文字の結合を防ぎます。ペルシャ語/アラビア語で正しい文字形態のために必須で、デーヴァナーガリーで合字を防ぐためにも使われます。
異なるスクリプトから来た同一または非常に似て見える文字。例:ラテン'a'とキリル'а'。フィッシング・スプーフィング・ソーシャルエンジニアリング攻撃に使われます。
confusables.txt(UCD)で定義された、視覚的に混同しやすい文字ペアに対するUnicodeの公式用語。ホモグリフより広い概念で、単に似ているだけの文字も含みます。
異なるスクリプトの文字を混在させるテキストを識別します(例:ラテン+キリル)。ホモグリフ攻撃に対する主要な防御で、ブラウザはこれを使ってPunycode表示をトリガーします。