Unicode Confusables: A Security Guide
Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or username appears legitimate. This guide explains what confusables are, how attackers exploit them, and how to detect and prevent confusable-based spoofing.
In 2000, the registration of pаypal.com (with a Cyrillic "а", U+0430, instead of a Latin "a",
U+0061) demonstrated a new class of attack that Unicode's success had inadvertently enabled.
The domain looked identical to paypal.com in most browsers of the era. This class of attack —
using visually similar characters from different scripts to impersonate legitimate identifiers —
is called a homoglyph attack or IDN spoofing, and the Unicode characters that enable it
are known as confusables.
This guide explains how confusable characters work, the specific attack vectors they enable, and the practical techniques available to detect and prevent them.
What Are Confusables?
A confusable (or homoglyph) is a Unicode character that looks identical or nearly identical to a different character, particularly when rendered in common fonts. The Unicode Consortium maintains an official confusables.txt data file listing thousands of these pairs.
Some representative examples:
| Intended | Confusable | Code Points | Scripts |
|---|---|---|---|
| a (Latin) | а (Cyrillic) | U+0061 vs U+0430 | Latin vs Cyrillic |
| o (Latin) | о (Cyrillic) | U+006F vs U+043E | Latin vs Cyrillic |
| e (Latin) | е (Cyrillic) | U+0065 vs U+0435 | Latin vs Cyrillic |
| l (Latin) | І (Cyrillic I) | U+006C vs U+0406 | Latin vs Cyrillic |
| 0 (digit) | O (Latin) | U+0030 vs U+004F | Digits vs Latin |
| rn (two chars) | m (one char) | — | Latin vs Latin |
| ν (Greek nu) | v (Latin) | U+03BD vs U+0076 | Greek vs Latin |
| Κ (Greek kappa) | K (Latin) | U+039A vs U+004B | Greek vs Latin |
The threat isn't limited to cross-script lookalikes. Even within the same script, many characters are visually indistinguishable in common sans-serif fonts:
l(lowercase L),I(uppercase i),1(digit one)0(digit zero),O(uppercase o)rn(r + n) looks likemat small sizes
Attack Vectors
1. IDN Spoofing (Internationalized Domain Names)
Internationalized Domain Names (IDN) allow non-ASCII characters in domain names via Punycode
encoding. pаypal.com with a Cyrillic "а" becomes xn--pypal-4ve.com in DNS — but browsers
display it as pаypal.com.
Modern browsers apply IDN homoglyph protection: if a domain mixes characters from different scripts (e.g., Latin and Cyrillic), most browsers will show the Punycode form instead. However, a domain composed entirely of Cyrillic lookalikes passes this check:
аррӏе.com → all Cyrillic lookalikes → browsers display as аррӏе.com
apple.com → all Latin → legitimate
These are different domains at the DNS level but visually identical in some rendering contexts.
2. Username Spoofing
Applications that display usernames, especially in security-sensitive contexts (e.g., "@admin" in a forum, "root" in a shell prompt), are vulnerable:
- An attacker registers
аdmin(Cyrillic а) and impersonates theadminaccount. - Notification emails reference the attacker's username; victims don't notice the difference.
- Social engineering becomes trivial when two accounts look identical in the UI.
3. Source Code Injection
Unicode confusables can be embedded in source code identifiers to create code that looks correct but does something different. Python 3 allows Unicode identifiers:
# Legitimate function
def calculate_total(price, tax):
return price + tax
# Attacker's version with Cyrillic confusables in the function name
# Looks identical in many editors
def саlculate_total(price, tax): # "с" is Cyrillic, not Latin "c"
return price * tax # Different logic!
This is particularly dangerous in code reviews where reviewers rarely check character-level identity. The Unicode Consortium's Unicode Source Code Security and Trojan Source research documented these risks.
4. Bidirectional (Bidi) Text Attacks
Unicode supports right-to-left scripts (Arabic, Hebrew) via bidirectional control characters. The Trojan Source attack (CVE-2021-42574) demonstrated that these characters can be used to make code appear to have one structure while its actual syntactic structure is different.
Key bidirectional control characters:
| Character | Code Point | Name | Effect |
|---|---|---|---|
| RLO | U+202E | RIGHT-TO-LEFT OVERRIDE | Forces subsequent text right-to-left |
| LRO | U+202D | LEFT-TO-RIGHT OVERRIDE | Forces subsequent text left-to-right |
| RLI | U+2066 | RIGHT-TO-LEFT ISOLATE | Isolates right-to-left span |
| U+202C | POP DIRECTIONAL FORMATTING | Ends a directional override |
A simplified Trojan Source example (in Python pseudocode for illustration):
# What the developer sees (after bidi rendering):
# access_level = "user"
# if access_level != "user":
# # Exploit code
# ...
# What the actual bytes say (bidi control characters reorder the display):
access_level = "user # Check if admin "
# The string literal actually contains control chars that make
# the comment appear to be inside the string in the editor
GitHub, GitLab, and most modern code hosts now warn about or block bidirectional control characters in source files.
Detection Techniques
Checking Against the Unicode Confusables List
The Unicode Consortium publishes confusables.txt, which maps each confusable character to its
"skeleton" — a normalized form that all visually similar characters share. If two strings have
the same skeleton, they are confusable:
# Simplified skeleton computation (real ICU implementation is more complex)
import unicodedata
import re
# The full confusables mapping would be loaded from confusables.txt
# Here we show the principle with a subset
CONFUSABLE_MAP: dict[str, str] = {
"\u0430": "a", # Cyrillic а → Latin a
"\u043E": "o", # Cyrillic о → Latin o
"\u0435": "e", # Cyrillic е → Latin e
"\u03BD": "v", # Greek ν → Latin v
}
def skeleton(text: str) -> str:
'''Compute a simplified confusable skeleton.'''
nfkd = unicodedata.normalize("NFKD", text)
return "".join(CONFUSABLE_MAP.get(c, c) for c in nfkd)
legitimate = "paypal"
spoofed = "p\u0430yp\u0430l" # Cyrillic а
print(skeleton(legitimate)) # paypal
print(skeleton(spoofed)) # paypal ← same skeleton!
print(legitimate == spoofed) # False (different bytes)
print(skeleton(legitimate) == skeleton(spoofed)) # True (confusable!)
The full implementation uses the complete confusables.txt mapping and applies NFKD normalization.
The ICU library provides com.ibm.icu.text.SpoofChecker (Java) and the icu4c C library with
equivalent functionality.
Script Mixing Detection
A simple heuristic: if a string mixes characters from multiple scripts (e.g., Latin and Cyrillic), it's suspicious. Unicode assigns every character a Script property:
import unicodedata
def get_scripts(text: str) -> set[str]:
scripts: set[str] = set()
for char in text:
name = unicodedata.name(char, "")
if "LATIN" in name:
scripts.add("Latin")
elif "CYRILLIC" in name:
scripts.add("Cyrillic")
elif "GREEK" in name:
scripts.add("Greek")
elif "ARABIC" in name:
scripts.add("Arabic")
# ... etc
return scripts
def is_mixed_script(text: str) -> bool:
scripts = get_scripts(text)
# Allow mixing with "Common" script (digits, punctuation)
pure_scripts = scripts - {"Common", "Inherited"}
return len(pure_scripts) > 1
print(is_mixed_script("paypal")) # False
print(is_mixed_script("p\u0430ypal")) # True (Latin + Cyrillic)
For production use, the fontTools library provides access to full Unicode script data, and the
icu-python (PyICU) package wraps the comprehensive ICU SpoofChecker.
Detecting Bidirectional Control Characters
BIDI_CONTROL_CHARS = {
"\u200e", # LEFT-TO-RIGHT MARK
"\u200f", # RIGHT-TO-LEFT MARK
"\u202a", # LEFT-TO-RIGHT EMBEDDING
"\u202b", # RIGHT-TO-LEFT EMBEDDING
"\u202c", # POP DIRECTIONAL FORMATTING
"\u202d", # LEFT-TO-RIGHT OVERRIDE
"\u202e", # RIGHT-TO-LEFT OVERRIDE
"\u2066", # LEFT-TO-RIGHT ISOLATE
"\u2067", # RIGHT-TO-LEFT ISOLATE
"\u2068", # FIRST STRONG ISOLATE
"\u2069", # POP DIRECTIONAL ISOLATE
}
def contains_bidi_controls(text: str) -> bool:
return any(c in BIDI_CONTROL_CHARS for c in text)
def strip_bidi_controls(text: str) -> str:
return "".join(c for c in text if c not in BIDI_CONTROL_CHARS)
# In JavaScript
function hasBidiControls(str) {
return /[\u200e\u200f\u202a-\u202e\u2066-\u2069]/.test(str);
}
Python unicodedata Security Checks
import unicodedata
def is_identifier_safe(name: str) -> bool:
'''Check if an identifier uses only ASCII-range characters.'''
try:
name.encode("ascii")
return True
except UnicodeEncodeError:
return False
def audit_identifier(name: str) -> dict:
return {
"value": name,
"is_ascii": is_identifier_safe(name),
"scripts": get_scripts(name),
"mixed_script": is_mixed_script(name),
"has_bidi": contains_bidi_controls(name),
"nfc_form": unicodedata.normalize("NFC", name),
"code_points": [f"U+{ord(c):04X} {unicodedata.name(c, 'UNKNOWN')}" for c in name],
}
Mitigations for Application Developers
| Layer | Mitigation | Notes |
|---|---|---|
| Input validation | Reject non-ASCII in identifiers/usernames | Strict but safe for most Western apps |
| Input validation | Script mixing detection | Allow multi-script for global apps |
| Input validation | Strip bidi control characters | Safe for most text |
| Storage | NFKC normalize usernames | Flatten compatibility variants |
| Display | Render Unicode identifiers in monospace font | Helps distinguish lookalikes |
| Display | Show Punycode for IDNs from mixed scripts | Follow browser behavior |
| Security | Use ICU SpoofChecker for authentication paths | Battle-tested implementation |
| Code review | Configure editors to show non-ASCII characters | VSCode: editor.renderControlCharacters: true |
| CI/CD | Add linter rule: no non-ASCII in identifiers | ruff rule RUF003 (ambiguous Unicode in comments) |
Key Takeaways
- Confusables are Unicode characters that look visually identical or nearly identical to other characters, often from different scripts (e.g., Cyrillic "а" vs Latin "a").
- The main attack vectors are: IDN spoofing (fake domains), username impersonation, source code injection (Trojan identifiers), and bidi text attacks (Trojan Source).
- The Unicode Consortium publishes the official confusables.txt data file mapping confusable characters to a normalized "skeleton".
- Detection strategies include: skeleton comparison, script mixing detection, and bidi control character scanning.
- For production: use the ICU SpoofChecker (available via PyICU in Python) for comprehensive confusable detection on security-sensitive paths.
- Always strip or reject bidi control characters (U+202A–U+202E, U+2066–U+2069) from source code, identifiers, and user-visible names.
Unicode Fundamentals में और
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …