📚 Unicode Fundamentals

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or username appears legitimate. This guide explains what confusables are, how attackers exploit them, and how to detect and prevent confusable-based spoofing.

Published 2021-11-08 · Updated 2024-07-29

In 2000, the registration of pаypal.com (with a Cyrillic "а", U+0430, instead of a Latin "a", U+0061) demonstrated a new class of attack that Unicode's success had inadvertently enabled. The domain looked identical to paypal.com in most browsers of the era. This class of attack — using visually similar characters from different scripts to impersonate legitimate identifiers — is called a homoglyph attack or IDN spoofing, and the Unicode characters that enable it are known as confusables.

This guide explains how confusable characters work, the specific attack vectors they enable, and the practical techniques available to detect and prevent them.

What Are Confusables?

A confusable (or homoglyph) is a Unicode character that looks identical or nearly identical to a different character, particularly when rendered in common fonts. The Unicode Consortium maintains an official confusables.txt data file listing thousands of these pairs.

Some representative examples:

Intended	Confusable	Code Points	Scripts
a (Latin)	а (Cyrillic)	U+0061 vs U+0430	Latin vs Cyrillic
o (Latin)	о (Cyrillic)	U+006F vs U+043E	Latin vs Cyrillic
e (Latin)	е (Cyrillic)	U+0065 vs U+0435	Latin vs Cyrillic
l (Latin)	І (Cyrillic I)	U+006C vs U+0406	Latin vs Cyrillic
0 (digit)	O (Latin)	U+0030 vs U+004F	Digits vs Latin
rn (two chars)	m (one char)	—	Latin vs Latin
ν (Greek nu)	v (Latin)	U+03BD vs U+0076	Greek vs Latin
Κ (Greek kappa)	K (Latin)	U+039A vs U+004B	Greek vs Latin

The threat isn't limited to cross-script lookalikes. Even within the same script, many characters are visually indistinguishable in common sans-serif fonts:

l (lowercase L), I (uppercase i), 1 (digit one)
0 (digit zero), O (uppercase o)
rn (r + n) looks like m at small sizes

Attack Vectors

1. IDN Spoofing (Internationalized Domain Names)

Internationalized Domain Names (IDN) allow non-ASCII characters in domain names via Punycode encoding. pаypal.com with a Cyrillic "а" becomes xn--pypal-4ve.com in DNS — but browsers display it as pаypal.com.

Modern browsers apply IDN homoglyph protection: if a domain mixes characters from different scripts (e.g., Latin and Cyrillic), most browsers will show the Punycode form instead. However, a domain composed entirely of Cyrillic lookalikes passes this check:

аррӏе.com  → all Cyrillic lookalikes → browsers display as аррӏе.com
apple.com  → all Latin → legitimate

These are different domains at the DNS level but visually identical in some rendering contexts.

2. Username Spoofing

Applications that display usernames, especially in security-sensitive contexts (e.g., "@admin" in a forum, "root" in a shell prompt), are vulnerable:

An attacker registers аdmin (Cyrillic а) and impersonates the admin account.
Notification emails reference the attacker's username; victims don't notice the difference.
Social engineering becomes trivial when two accounts look identical in the UI.

3. Source Code Injection

Unicode confusables can be embedded in source code identifiers to create code that looks correct but does something different. Python 3 allows Unicode identifiers:

# Legitimate function
def calculate_total(price, tax):
    return price + tax

# Attacker's version with Cyrillic confusables in the function name
# Looks identical in many editors
def саlculate_total(price, tax):   # "с" is Cyrillic, not Latin "c"
    return price * tax  # Different logic!

This is particularly dangerous in code reviews where reviewers rarely check character-level identity. The Unicode Consortium's Unicode Source Code Security and Trojan Source research documented these risks.

4. Bidirectional (Bidi) Text Attacks

Unicode supports right-to-left scripts (Arabic, Hebrew) via bidirectional control characters. The Trojan Source attack (CVE-2021-42574) demonstrated that these characters can be used to make code appear to have one structure while its actual syntactic structure is different.

Key bidirectional control characters:

Character	Code Point	Name	Effect
RLO	U+202E	RIGHT-TO-LEFT OVERRIDE	Forces subsequent text right-to-left
LRO	U+202D	LEFT-TO-RIGHT OVERRIDE	Forces subsequent text left-to-right
RLI	U+2066	RIGHT-TO-LEFT ISOLATE	Isolates right-to-left span
PDF	U+202C	POP DIRECTIONAL FORMATTING	Ends a directional override

A simplified Trojan Source example (in Python pseudocode for illustration):

# What the developer sees (after bidi rendering):
# access_level = "user"
# if access_level != "user":
#   # Exploit code
#   ...

# What the actual bytes say (bidi control characters reorder the display):
access_level = "user ‮ ⁦# Check if admin⁩ ⁦"
# The string literal actually contains control chars that make
# the comment appear to be inside the string in the editor

GitHub, GitLab, and most modern code hosts now warn about or block bidirectional control characters in source files.

Detection Techniques

Checking Against the Unicode Confusables List

The Unicode Consortium publishes confusables.txt, which maps each confusable character to its "skeleton" — a normalized form that all visually similar characters share. If two strings have the same skeleton, they are confusable:

# Simplified skeleton computation (real ICU implementation is more complex)
import unicodedata
import re

# The full confusables mapping would be loaded from confusables.txt
# Here we show the principle with a subset
CONFUSABLE_MAP: dict[str, str] = {
    "\u0430": "a",   # Cyrillic а → Latin a
    "\u043E": "o",   # Cyrillic о → Latin o
    "\u0435": "e",   # Cyrillic е → Latin e
    "\u03BD": "v",   # Greek ν → Latin v
}

def skeleton(text: str) -> str:
    '''Compute a simplified confusable skeleton.'''
    nfkd = unicodedata.normalize("NFKD", text)
    return "".join(CONFUSABLE_MAP.get(c, c) for c in nfkd)

legitimate = "paypal"
spoofed    = "p\u0430yp\u0430l"   # Cyrillic а

print(skeleton(legitimate))   # paypal
print(skeleton(spoofed))      # paypal  ← same skeleton!
print(legitimate == spoofed)  # False (different bytes)
print(skeleton(legitimate) == skeleton(spoofed))  # True (confusable!)

The full implementation uses the complete confusables.txt mapping and applies NFKD normalization. The ICU library provides com.ibm.icu.text.SpoofChecker (Java) and the icu4c C library with equivalent functionality.

Script Mixing Detection

A simple heuristic: if a string mixes characters from multiple scripts (e.g., Latin and Cyrillic), it's suspicious. Unicode assigns every character a Script property:

import unicodedata

def get_scripts(text: str) -> set[str]:
    scripts: set[str] = set()
    for char in text:
        name = unicodedata.name(char, "")
        if "LATIN" in name:
            scripts.add("Latin")
        elif "CYRILLIC" in name:
            scripts.add("Cyrillic")
        elif "GREEK" in name:
            scripts.add("Greek")
        elif "ARABIC" in name:
            scripts.add("Arabic")
        # ... etc
    return scripts

def is_mixed_script(text: str) -> bool:
    scripts = get_scripts(text)
    # Allow mixing with "Common" script (digits, punctuation)
    pure_scripts = scripts - {"Common", "Inherited"}
    return len(pure_scripts) > 1

print(is_mixed_script("paypal"))       # False
print(is_mixed_script("p\u0430ypal")) # True (Latin + Cyrillic)

For production use, the fontTools library provides access to full Unicode script data, and the icu-python (PyICU) package wraps the comprehensive ICU SpoofChecker.

Detecting Bidirectional Control Characters

BIDI_CONTROL_CHARS = {
    "\u200e",  # LEFT-TO-RIGHT MARK
    "\u200f",  # RIGHT-TO-LEFT MARK
    "\u202a",  # LEFT-TO-RIGHT EMBEDDING
    "\u202b",  # RIGHT-TO-LEFT EMBEDDING
    "\u202c",  # POP DIRECTIONAL FORMATTING
    "\u202d",  # LEFT-TO-RIGHT OVERRIDE
    "\u202e",  # RIGHT-TO-LEFT OVERRIDE
    "\u2066",  # LEFT-TO-RIGHT ISOLATE
    "\u2067",  # RIGHT-TO-LEFT ISOLATE
    "\u2068",  # FIRST STRONG ISOLATE
    "\u2069",  # POP DIRECTIONAL ISOLATE
}

def contains_bidi_controls(text: str) -> bool:
    return any(c in BIDI_CONTROL_CHARS for c in text)

def strip_bidi_controls(text: str) -> str:
    return "".join(c for c in text if c not in BIDI_CONTROL_CHARS)

# In JavaScript
function hasBidiControls(str) {
    return /[\u200e\u200f\u202a-\u202e\u2066-\u2069]/.test(str);
}

Python `unicodedata` Security Checks

import unicodedata

def is_identifier_safe(name: str) -> bool:
    '''Check if an identifier uses only ASCII-range characters.'''
    try:
        name.encode("ascii")
        return True
    except UnicodeEncodeError:
        return False

def audit_identifier(name: str) -> dict:
    return {
        "value": name,
        "is_ascii": is_identifier_safe(name),
        "scripts": get_scripts(name),
        "mixed_script": is_mixed_script(name),
        "has_bidi": contains_bidi_controls(name),
        "nfc_form": unicodedata.normalize("NFC", name),
        "code_points": [f"U+{ord(c):04X} {unicodedata.name(c, 'UNKNOWN')}" for c in name],
    }

Mitigations for Application Developers

Layer	Mitigation	Notes
Input validation	Reject non-ASCII in identifiers/usernames	Strict but safe for most Western apps
Input validation	Script mixing detection	Allow multi-script for global apps
Input validation	Strip bidi control characters	Safe for most text
Storage	NFKC normalize usernames	Flatten compatibility variants
Display	Render Unicode identifiers in monospace font	Helps distinguish lookalikes
Display	Show Punycode for IDNs from mixed scripts	Follow browser behavior
Security	Use ICU SpoofChecker for authentication paths	Battle-tested implementation
Code review	Configure editors to show non-ASCII characters	VSCode: `editor.renderControlCharacters: true`
CI/CD	Add linter rule: no non-ASCII in identifiers	`ruff` rule `RUF003` (ambiguous Unicode in comments)

Key Takeaways

Confusables are Unicode characters that look visually identical or nearly identical to other characters, often from different scripts (e.g., Cyrillic "а" vs Latin "a").
The main attack vectors are: IDN spoofing (fake domains), username impersonation, source code injection (Trojan identifiers), and bidi text attacks (Trojan Source).
The Unicode Consortium publishes the official confusables.txt data file mapping confusable characters to a normalized "skeleton".
Detection strategies include: skeleton comparison, script mixing detection, and bidi control character scanning.
For production: use the ICU SpoofChecker (available via PyICU in Python) for comprehensive confusable detection on security-sensitive paths.
Always strip or reject bidi control characters (U+202A–U+202E, U+2066–U+2069) from source code, identifiers, and user-visible names.