혼동 가능 문자
confusables.txt(UCD)에 정의된 시각적으로 혼동될 수 있는 문자 쌍에 대한 유니코드 공식 용어. 동형이자보다 넓은 개념으로 단순히 유사한 문자도 포함합니다.
What is a Confusable Character?
In Unicode security terminology, a confusable is any character that a human reader might mistake for a different character due to visual similarity. The Unicode Consortium formally defines and tracks confusables in Unicode Technical Report #39 (Unicode Security Mechanisms), maintaining a publicly available data file called confusables.txt as part of the Unicode Character Database.
Unlike strict homoglyphs — which are nearly pixel-perfect matches — confusables is a broader category. It includes characters that are visually similar under normal reading conditions even if a careful side-by-side comparison would reveal differences. The degree of confusion depends heavily on font, rendering size, and the reader's familiarity with the scripts involved.
The Unicode Confusables Dataset
The confusables.txt file maps thousands of Unicode code points to their prototype — the character they most resemble. For example:
- Cyrillic а (U+0430) maps to Latin a (U+0061)
- Greek ο (U+03BF) maps to Latin o (U+006F)
- Fullwidth Latin A (U+FF21) maps to Latin A (U+0041)
- Mathematical bold 𝐚 (U+1D41A) maps to Latin a (U+0061)
This mapping is not symmetric or transitive in the raw data — it is a one-way "looks like" relationship. Security implementations typically build a bidirectional index from this data so that any two characters that share a prototype are considered mutually confusable.
Categories of Confusables
Confusables arise from several distinct sources:
- Cross-script lookalikes — Characters from different scripts that developed independently but converged on the same glyph shape (Latin vs. Cyrillic vs. Greek)
- Letterlike symbols — Mathematical, technical, and letterlike Unicode blocks contain styled versions of Latin letters (bold, italic, script, fraktur) that are visually indistinguishable from plain letters in some contexts
- Fullwidth and halfwidth forms — The Halfwidth and Fullwidth Forms block (U+FF00–U+FFEF) duplicates ASCII characters for CJK compatibility
- Digraphs and ligatures — Characters like fi (U+FB01, fi ligature) can be confused with the two-character sequence "fi"
- Digit lookalikes — The digit 0 (U+0030) and uppercase letter O (U+004F) are confusable; so are 1 (U+0031), lowercase l (U+006C), and uppercase I (U+0049)
Security Implications
Confusables are the technical foundation of several attack classes:
- Phishing domains: Registering a domain with confusable characters to impersonate a legitimate site
- Username squatting: Creating social media accounts that appear identical to celebrity or brand accounts
- Code injection via source files: Inserting confusable characters into identifiers so that malicious code passes visual inspection
- Credential stuffing assistance: Generating variant spellings of passwords or usernames
Defensive Strategies
The Unicode TR39 specification defines confusability checks for identifiers and strings:
- Single-script confusable check: A string is flagged if it contains characters confusable with characters from a different script while appearing to be single-script
- Whole-script confusable check: An entire string written in script X that is confusable with a string written in script Y
- Mixed-script confusable check: A string containing characters from multiple scripts where substitution could create confusion
Modern web frameworks, domain registrars, and identity platforms increasingly apply these checks automatically.
Quick Facts
| Property | Value |
|---|---|
| Governing standard | Unicode TR39 — Unicode Security Mechanisms |
| Data file | confusables.txt (Unicode Character Database) |
| Number of confusable mappings | 3,000+ code points mapped |
| Update frequency | With each Unicode release |
| Prototype concept | Characters map to a single "representative" lookalike |
| Related attacks | IDN homograph, phishing, code obfuscation |
| Key distinction from homoglyph | Broader — includes near-matches, not just identical glyphs |
관련 용어
보안의 더 많은 용어
Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …
도메인 이름에 시각적으로 유사한 유니코드 문자를 사용하여 합법적인 사이트를 사칭하는 공격. аpple.com(키릴 …
Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …
U+200D. 인접 문자의 결합을 요청합니다. 이모지 시퀀스에 필수적입니다(👩+ZWJ+💻=👩💻). 인도 문자에서는 합자 형성을 …
U+200C. 인접 문자의 결합을 방지합니다. 페르시아어/아랍어에서 올바른 글자 형태를 위해 필수적이며, 데바나가리에서 …
서로 다른 문자 체계에서 동일하거나 매우 유사하게 보이는 문자. 예: 라틴 'a'와 …
유니코드 양방향 재정의 문자(U+202A~U+202E, U+2066~U+2069)를 사용하여 악성 파일 이름이나 코드를 위장하는 공격. …
유니코드 기능을 사용하여 사용자를 속이는 것: 가짜 도메인을 위한 동형이자, 가짜 파일 …
서로 다른 문자 체계의 문자를 혼합하는 텍스트를 식별합니다(예: 라틴 + 키릴). 동형이자 …