📚 Unicode Fundamentals

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or username appears legitimate. This guide explains what confusables are, how attackers exploit them, and how to detect and prevent confusable-based spoofing.

·

In 2000, the registration of pаypal.com (with a Cyrillic "а", U+0430, instead of a Latin "a", U+0061) demonstrated a new class of attack that Unicode's success had inadvertently enabled. The domain looked identical to paypal.com in most browsers of the era. This class of attack — using visually similar characters from different scripts to impersonate legitimate identifiers — is called a homoglyph attack or IDN spoofing, and the Unicode characters that enable it are known as confusables.

This guide explains how confusable characters work, the specific attack vectors they enable, and the practical techniques available to detect and prevent them.

What Are Confusables?

A confusable (or homoglyph) is a Unicode character that looks identical or nearly identical to a different character, particularly when rendered in common fonts. The Unicode Consortium maintains an official confusables.txt data file listing thousands of these pairs.

Some representative examples:

Intended Confusable Code Points Scripts
a (Latin) а (Cyrillic) U+0061 vs U+0430 Latin vs Cyrillic
o (Latin) о (Cyrillic) U+006F vs U+043E Latin vs Cyrillic
e (Latin) е (Cyrillic) U+0065 vs U+0435 Latin vs Cyrillic
l (Latin) І (Cyrillic I) U+006C vs U+0406 Latin vs Cyrillic
0 (digit) O (Latin) U+0030 vs U+004F Digits vs Latin
rn (two chars) m (one char) Latin vs Latin
ν (Greek nu) v (Latin) U+03BD vs U+0076 Greek vs Latin
Κ (Greek kappa) K (Latin) U+039A vs U+004B Greek vs Latin

The threat isn't limited to cross-script lookalikes. Even within the same script, many characters are visually indistinguishable in common sans-serif fonts:

  • l (lowercase L), I (uppercase i), 1 (digit one)
  • 0 (digit zero), O (uppercase o)
  • rn (r + n) looks like m at small sizes

Attack Vectors

1. IDN Spoofing (Internationalized Domain Names)

Internationalized Domain Names (IDN) allow non-ASCII characters in domain names via Punycode encoding. pаypal.com with a Cyrillic "а" becomes xn--pypal-4ve.com in DNS — but browsers display it as pаypal.com.

Modern browsers apply IDN homoglyph protection: if a domain mixes characters from different scripts (e.g., Latin and Cyrillic), most browsers will show the Punycode form instead. However, a domain composed entirely of Cyrillic lookalikes passes this check:

аррӏе.com  → all Cyrillic lookalikes → browsers display as аррӏе.com
apple.com  → all Latin → legitimate

These are different domains at the DNS level but visually identical in some rendering contexts.

2. Username Spoofing

Applications that display usernames, especially in security-sensitive contexts (e.g., "@admin" in a forum, "root" in a shell prompt), are vulnerable:

  • An attacker registers аdmin (Cyrillic а) and impersonates the admin account.
  • Notification emails reference the attacker's username; victims don't notice the difference.
  • Social engineering becomes trivial when two accounts look identical in the UI.

3. Source Code Injection

Unicode confusables can be embedded in source code identifiers to create code that looks correct but does something different. Python 3 allows Unicode identifiers:

# Legitimate function
def calculate_total(price, tax):
    return price + tax

# Attacker's version with Cyrillic confusables in the function name
# Looks identical in many editors
def саlculate_total(price, tax):   # "с" is Cyrillic, not Latin "c"
    return price * tax  # Different logic!

This is particularly dangerous in code reviews where reviewers rarely check character-level identity. The Unicode Consortium's Unicode Source Code Security and Trojan Source research documented these risks.

4. Bidirectional (Bidi) Text Attacks

Unicode supports right-to-left scripts (Arabic, Hebrew) via bidirectional control characters. The Trojan Source attack (CVE-2021-42574) demonstrated that these characters can be used to make code appear to have one structure while its actual syntactic structure is different.

Key bidirectional control characters:

Character Code Point Name Effect
RLO U+202E RIGHT-TO-LEFT OVERRIDE Forces subsequent text right-to-left
LRO U+202D LEFT-TO-RIGHT OVERRIDE Forces subsequent text left-to-right
RLI U+2066 RIGHT-TO-LEFT ISOLATE Isolates right-to-left span
PDF U+202C POP DIRECTIONAL FORMATTING Ends a directional override

A simplified Trojan Source example (in Python pseudocode for illustration):

# What the developer sees (after bidi rendering):
# access_level = "user"
# if access_level != "user":
#   # Exploit code
#   ...

# What the actual bytes say (bidi control characters reorder the display):
access_level = "user ‮ ⁦# Check if admin⁩ ⁦"
# The string literal actually contains control chars that make
# the comment appear to be inside the string in the editor

GitHub, GitLab, and most modern code hosts now warn about or block bidirectional control characters in source files.

Detection Techniques

Checking Against the Unicode Confusables List

The Unicode Consortium publishes confusables.txt, which maps each confusable character to its "skeleton" — a normalized form that all visually similar characters share. If two strings have the same skeleton, they are confusable:

# Simplified skeleton computation (real ICU implementation is more complex)
import unicodedata
import re

# The full confusables mapping would be loaded from confusables.txt
# Here we show the principle with a subset
CONFUSABLE_MAP: dict[str, str] = {
    "\u0430": "a",   # Cyrillic а → Latin a
    "\u043E": "o",   # Cyrillic о → Latin o
    "\u0435": "e",   # Cyrillic е → Latin e
    "\u03BD": "v",   # Greek ν → Latin v
}

def skeleton(text: str) -> str:
    '''Compute a simplified confusable skeleton.'''
    nfkd = unicodedata.normalize("NFKD", text)
    return "".join(CONFUSABLE_MAP.get(c, c) for c in nfkd)

legitimate = "paypal"
spoofed    = "p\u0430yp\u0430l"   # Cyrillic а

print(skeleton(legitimate))   # paypal
print(skeleton(spoofed))      # paypal  ← same skeleton!
print(legitimate == spoofed)  # False (different bytes)
print(skeleton(legitimate) == skeleton(spoofed))  # True (confusable!)

The full implementation uses the complete confusables.txt mapping and applies NFKD normalization. The ICU library provides com.ibm.icu.text.SpoofChecker (Java) and the icu4c C library with equivalent functionality.

Script Mixing Detection

A simple heuristic: if a string mixes characters from multiple scripts (e.g., Latin and Cyrillic), it's suspicious. Unicode assigns every character a Script property:

import unicodedata

def get_scripts(text: str) -> set[str]:
    scripts: set[str] = set()
    for char in text:
        name = unicodedata.name(char, "")
        if "LATIN" in name:
            scripts.add("Latin")
        elif "CYRILLIC" in name:
            scripts.add("Cyrillic")
        elif "GREEK" in name:
            scripts.add("Greek")
        elif "ARABIC" in name:
            scripts.add("Arabic")
        # ... etc
    return scripts

def is_mixed_script(text: str) -> bool:
    scripts = get_scripts(text)
    # Allow mixing with "Common" script (digits, punctuation)
    pure_scripts = scripts - {"Common", "Inherited"}
    return len(pure_scripts) > 1

print(is_mixed_script("paypal"))       # False
print(is_mixed_script("p\u0430ypal")) # True (Latin + Cyrillic)

For production use, the fontTools library provides access to full Unicode script data, and the icu-python (PyICU) package wraps the comprehensive ICU SpoofChecker.

Detecting Bidirectional Control Characters

BIDI_CONTROL_CHARS = {
    "\u200e",  # LEFT-TO-RIGHT MARK
    "\u200f",  # RIGHT-TO-LEFT MARK
    "\u202a",  # LEFT-TO-RIGHT EMBEDDING
    "\u202b",  # RIGHT-TO-LEFT EMBEDDING
    "\u202c",  # POP DIRECTIONAL FORMATTING
    "\u202d",  # LEFT-TO-RIGHT OVERRIDE
    "\u202e",  # RIGHT-TO-LEFT OVERRIDE
    "\u2066",  # LEFT-TO-RIGHT ISOLATE
    "\u2067",  # RIGHT-TO-LEFT ISOLATE
    "\u2068",  # FIRST STRONG ISOLATE
    "\u2069",  # POP DIRECTIONAL ISOLATE
}

def contains_bidi_controls(text: str) -> bool:
    return any(c in BIDI_CONTROL_CHARS for c in text)

def strip_bidi_controls(text: str) -> str:
    return "".join(c for c in text if c not in BIDI_CONTROL_CHARS)

# In JavaScript
function hasBidiControls(str) {
    return /[\u200e\u200f\u202a-\u202e\u2066-\u2069]/.test(str);
}

Python unicodedata Security Checks

import unicodedata

def is_identifier_safe(name: str) -> bool:
    '''Check if an identifier uses only ASCII-range characters.'''
    try:
        name.encode("ascii")
        return True
    except UnicodeEncodeError:
        return False

def audit_identifier(name: str) -> dict:
    return {
        "value": name,
        "is_ascii": is_identifier_safe(name),
        "scripts": get_scripts(name),
        "mixed_script": is_mixed_script(name),
        "has_bidi": contains_bidi_controls(name),
        "nfc_form": unicodedata.normalize("NFC", name),
        "code_points": [f"U+{ord(c):04X} {unicodedata.name(c, 'UNKNOWN')}" for c in name],
    }

Mitigations for Application Developers

Layer Mitigation Notes
Input validation Reject non-ASCII in identifiers/usernames Strict but safe for most Western apps
Input validation Script mixing detection Allow multi-script for global apps
Input validation Strip bidi control characters Safe for most text
Storage NFKC normalize usernames Flatten compatibility variants
Display Render Unicode identifiers in monospace font Helps distinguish lookalikes
Display Show Punycode for IDNs from mixed scripts Follow browser behavior
Security Use ICU SpoofChecker for authentication paths Battle-tested implementation
Code review Configure editors to show non-ASCII characters VSCode: editor.renderControlCharacters: true
CI/CD Add linter rule: no non-ASCII in identifiers ruff rule RUF003 (ambiguous Unicode in comments)

Key Takeaways

  • Confusables are Unicode characters that look visually identical or nearly identical to other characters, often from different scripts (e.g., Cyrillic "а" vs Latin "a").
  • The main attack vectors are: IDN spoofing (fake domains), username impersonation, source code injection (Trojan identifiers), and bidi text attacks (Trojan Source).
  • The Unicode Consortium publishes the official confusables.txt data file mapping confusable characters to a normalized "skeleton".
  • Detection strategies include: skeleton comparison, script mixing detection, and bidi control character scanning.
  • For production: use the ICU SpoofChecker (available via PyICU in Python) for comprehensive confusable detection on security-sensitive paths.
  • Always strip or reject bidi control characters (U+202A–U+202E, U+2066–U+2069) from source code, identifiers, and user-visible names.

Mehr in Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …