The Unicode Odyssey · Глава 8
Security: The Dark Side of Unicode
Unicode's power can be exploited. Homoglyph attacks, bidirectional text abuse, zero-width character injection, and IDN spoofing — this chapter explores the security risks and how to defend against them.
Every powerful system has a shadow. Unicode's power — representing every human writing system, providing rich character composition, supporting bidirectional text — creates an attack surface that security researchers have been documenting for decades. The vulnerabilities are subtle, the attacks often invisible to the naked eye, and the consequences can range from phishing success to complete system compromise. This chapter maps the dark side.
Homograph Attacks: Characters That Look Identical
The most classic Unicode security attack exploits the fact that characters from different scripts can look visually indistinguishable to human readers. The homograph attack (also called homoglyph attack) substitutes one script's character for another's to deceive users.
The canonical example: the Cyrillic letter "а" (U+0430, CYRILLIC SMALL LETTER A) is visually identical to the Latin "a" (U+0061, LATIN SMALL LETTER A) in most fonts. To a user reading "аpple.com" in their browser's address bar, there is no visible difference — but one domain is Cyrillic-а and the other is Latin-a. A malicious actor registers the Cyrillic variant and creates a perfect phishing site.
Common homograph character pairs:
| Lookalike Char | Unicode | Legitimate | Unicode |
|---|---|---|---|
| а (Cyrillic a) | U+0430 | a (Latin) | U+0061 |
| е (Cyrillic e) | U+0435 | e (Latin) | U+0065 |
| о (Cyrillic o) | U+043E | o (Latin) | U+006F |
| р (Cyrillic r) | U+0440 | p (Latin) | U+0070 |
| с (Cyrillic c) | U+0441 | c (Latin) | U+0063 |
| ν (Greek nu) | U+03BD | v (Latin) | U+0076 |
| Ι (Greek capital iota) | U+0399 | I (Latin) | U+0049 |
| ℓ (script small l) | U+2113 | l (Latin) | U+006C |
An entire word can be spelled using only Cyrillic characters that look like Latin ones: "соm" (all Cyrillic) appears identical to "com". A URL https://microsоft.соm with Cyrillic characters would appear legitimate to most users.
IDN Homograph Attacks
Internationalized Domain Names (IDNs) allow non-ASCII characters in domain names, stored internally as punycode. The domain "аpple.com" (with Cyrillic a) becomes "xn--pple-43d.com" in punycode — but browsers may display the Unicode form.
Browsers have developed mitigation strategies:
- Show punycode instead of Unicode if the domain mixes scripts
- Maintain lists of confusable characters
- Some browsers show a warning for potentially deceptive IDNs
The Unicode Consortium publishes Confusables data — a table mapping each character to the set of characters it could be confused with. Security libraries use this data to detect potential homograph attacks.
Bidi Override: The Trojan Source Attack
In 2021, researchers at the University of Cambridge published Trojan Source — a class of attacks exploiting Unicode bidirectional control characters to make source code appear different to human reviewers than it does to compilers.
The attack uses characters like:
| Character | Codepoint | Effect |
|---|---|---|
| RIGHT-TO-LEFT OVERRIDE | U+202E | Everything after becomes RTL |
| RIGHT-TO-LEFT EMBEDDING | U+202B | Open RTL embedding |
| LEFT-TO-RIGHT EMBEDDING | U+202A | Open LTR embedding |
| POP DIRECTIONAL FORMATTING | U+202C | Close embedding |
| RIGHT-TO-LEFT ISOLATE | U+2067 | RTL isolate |
These invisible characters change the visual display order of text without changing the logical byte order. A compiler or interpreter sees the bytes in logical order (actual execution semantics), while a human code reviewer sees the visually reordered display.
A concrete attack example (pseudocode, not actual exploit):
# A comment containing bidi override appears to close early:
# access_level = "user" U+202E /* is_admin = true */
# The bidi override makes the viewer see:
# access_level = "user" */ true = is_admin /*
# But the compiler/interpreter sees:
# access_level = "user" <RTL-OVERRIDE>/* is_admin = true */
# Which executes the malicious assignment
The attack is particularly dangerous in code review workflows — a human reviewer sees benign-looking code, approves it, and the compiled result executes a different logical flow.
Mitigation: Most modern code editors and code review tools (GitHub, GitLab) now warn when source files contain bidi control characters. Linters like ESLint and pylint have rules to detect them. Unicode Technical Report #36 recommends that text editing tools make bidi control characters visible.
Invisible Characters in Source Code
Beyond bidi overrides, several Unicode characters are completely invisible and can hide malicious content in source code or data:
| Character | Codepoint | Usage |
|---|---|---|
| Zero Width Space | U+200B | Word boundary hint, commonly used in text obfuscation |
| Zero Width Non-Joiner | U+200C | Prevents ligature formation |
| Zero Width Joiner | U+200D | Creates ZWJ emoji sequences |
| Word Joiner | U+2060 | Prevents line breaking, invisible |
| Soft Hyphen | U+00AD | Optional hyphen, invisible unless word-wrapped |
| Mongolian Free Variation Selector | U+180E | Formerly a space, now a non-character |
These can appear inside identifiers in some languages, creating identifiers that look identical but are different:
# Two different variable names that look identical:
paypal = "legitimate"
pay\u200bpal = "malicious" # zero-width space inside name
Python 3 specifically normalizes identifiers to NFKC before comparison, which prevents many such attacks. JavaScript historically was more permissive.
Zero-Width Character Fingerprinting
Zero-width characters can be used to uniquely fingerprint documents for tracking who leaked a file. By inserting different patterns of zero-width spaces and zero-width non-joiners in the invisible whitespace within a document, an organization can create multiple visually identical versions of a sensitive document, each with a unique "fingerprint."
When the document leaks, comparing the pattern of invisible characters identifies which copy was shared and thus who the leaker was.
This dual-use technique — useful for security (tracking leaks) and harmful for privacy (tracking users) — illustrates how the same Unicode features can serve opposing purposes.
Normalization-Based Bypasses
Security filters that check for dangerous patterns before normalization are vulnerable to normalization-based bypasses. If a web application:
- Checks input for dangerous strings like
../(path traversal) - Stores the input
- Normalizes at use time
Then input containing ..%2F or Unicode lookalikes of the slash character might bypass the filter check but normalize to ../ at the point of use.
The NFKC normalization of certain characters can produce surprising results:
FULLWIDTH SOLIDUS / (U+FF0F) → normalizes to / (U+002F) under NFKC
FRACTION SLASH ⁄ (U+2044) → stays as ⁄ under NFC but may be treated as / by some parsers
Applications that perform security checks before normalization, or that use different normalization forms at different stages, create windows for bypass attacks.
Mitigation: Normalize input as early as possible (at the system boundary), before any security checks. Then apply security checks to the normalized form.
Case Folding and Case-Insensitive Matching Attacks
Unicode case folding (converting text to a canonical case-insensitive form for comparison) has edge cases that can create security bypasses.
The Turkish dotted/dotless I is the classic example: - Turkish: 'i' uppercases to 'İ' (U+0130, Latin capital I with dot above) - Turkish: 'I' lowercases to 'ı' (U+0131, Latin small letter dotless i) - English: 'i' uppercases to 'I', 'I' lowercases to 'i'
If a system performs case-folding using the Turkish locale but validates against patterns built for the English locale, the string "ADMIN" might produce a case-folded form that doesn't match the expected pattern for "admin" — creating authentication bypass possibilities.
The German sharp S (ß) uppercases to "SS" in traditional German, meaning a single character expands to two on case conversion. Buffer overflows have historically been found in libraries that allocated space for the original string length but wrote the uppercase form.
Username Confusion and Account Takeover
Many systems perform case-insensitive username matching: "Alice" and "alice" are the same account. If normalization isn't applied consistently, "Alice" (with full-width Latin A) might be considered different from "Alice" at registration but match at login — or vice versa — creating account enumeration or takeover possibilities.
The PRECIS framework (RFC 8264) provides guidance for application protocols that need to process usernames and passwords with Unicode, defining profile rules for when to apply specific normalizations and restrictions.
Defense in Depth
No single mitigation addresses all Unicode security issues. Robust defense requires multiple layers:
- Normalize at input boundaries: Apply NFC or NFKC to all text inputs immediately
- Validate after normalization: All security checks operate on normalized data
- Reject invisible characters: Strip or reject zero-width characters, bidi controls, and other invisible codepoints in sensitive contexts (usernames, code)
- Use confusables data: For usernames and identifiers, check against Unicode confusables database
- Display punycode for mixed-script domains: Don't display IDN Unicode form when scripts are mixed
- Lint for bidi controls: Integrate bidi-control warnings into code review pipelines
- Apply Unicode security profiles: Follow OWASP's Unicode guidance and RFC 5892 for identifiers
Unicode security is an ongoing research area. As new character blocks are added and new rendering behaviors emerge, the attack surface evolves. Staying informed about Unicode security advisories and incorporating them into your security model is part of the cost of working with truly internationalized software.