安全

Unicode 欺骗

利用Unicode功能欺骗用户:同形字用于假冒域名,双向覆盖用于伪造文件扩展名,不可见字符用于隐藏文本。

· Updated

What is Unicode Spoofing?

Unicode spoofing is a class of cyberattack that exploits the visual similarity between Unicode characters to deceive users, systems, or automated tools. Rather than hacking servers or stealing credentials directly, Unicode spoofing attacks manipulate human perception — making something malicious appear legitimate by substituting visually identical characters from different Unicode code points.

The attack is possible because Unicode encodes over 149,000 characters from more than 150 scripts, and many characters across those scripts look identical or nearly identical when rendered on screen. A Latin a, a Cyrillic а, and a Greek α are three different code points, but most fonts render them identically.

How Unicode Spoofing Works

The general pattern involves three steps:

  1. Identify a target string — a domain name, username, file name, or code identifier that the attacker wants to impersonate
  2. Substitute lookalike characters — replace one or more characters with visually identical Unicode equivalents from a different script or block
  3. Deploy the spoofed string — register the domain, create the account, commit the file, or send the message

To a human reader — and to many software systems that do not perform script analysis — the spoofed string appears identical to the original.

Common Attack Scenarios

Phishing via IDN homograph attack An attacker registers аpple.com where а is Cyrillic (U+0430) instead of Latin (U+0061). The domain resolves to a phishing server. Users who click a link to this domain see what looks like apple.com in the address bar, especially in older browsers or email clients that do not display Punycode.

Username impersonation On platforms that allow Unicode usernames, an attacker creates @elоn with a Cyrillic о (U+043E). Followers of the real @elon may be deceived into interacting with the fake account, especially in notifications or @mentions.

Source code backdoors A malicious contributor submits code containing a function def verify_раssword(...) where р and а are Cyrillic. The function appears to be verify_password in code review. The real verify_password function is never called in certain paths, allowing authentication bypass.

File name spoofing A malicious file named report_finalе.pdf uses a Cyrillic е (U+0435) at the end. File managers display it identically to report_finale.pdf. Combined with bidirectional override characters, the displayed filename can be made to look entirely different from the actual filename.

Mitigation Techniques

At the browser level: Modern browsers convert internationalized domain names containing mixed scripts to Punycode display (e.g., xn--pple-43d.com) to alert users to potential spoofing.

At the platform level: Social platforms can normalize usernames by mapping confusable characters to a canonical form, then preventing registration of two usernames that normalize identically.

At the application level: Developers can apply Unicode TR39 confusables checks to any identifier or string that will be displayed to users alongside other identifiers.

At the code review level: Security-aware editors and static analysis tools can flag source files that contain characters outside the expected ASCII or script range.

Quick Facts

Property Value
Root cause Visual equivalence across Unicode scripts
Key enabling standard Unicode TR39 confusables dataset
Primary attack surfaces Domain names, usernames, source code, filenames
Technical name for domain variant IDN homograph attack
Browser defense Punycode fallback rendering
Source code defense Linters, Unicode character set whitelisting
Year of notable browser fix 2005 (Firefox added Punycode fallback)

相关术语

安全 中的更多内容

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

IDN 同形字攻击

在域名中使用视觉上相似的Unicode字符来冒充合法网站的攻击,аpple.com(西里尔а)看起来像apple.com,浏览器通过Punycode显示规则加以防范。

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …

双向覆盖攻击

利用Unicode双向覆盖字符(U+202A–U+202E、U+2066–U+2069)伪装恶意文件名或代码的攻击,'readme‮fdp.exe'显示为'readmeexe.pdf'。

同形字

来自不同文字系统但外观相同或非常相似的字符,如拉丁'a'与西里尔'а',用于网络钓鱼、欺骗和社会工程学攻击。

易混淆字符

Unicode对视觉上可能混淆的字符对的官方术语,定义于confusables.txt(UCD),比同形字范围更广,包含仅仅相似而非完全相同的字符。

混合文字系统检测

识别混合不同文字系统字符的文本(如拉丁文+西里尔文),是防御同形字攻击的主要手段,浏览器据此触发Punycode显示。

零宽连接符 (ZWJ)

U+200D,请求相邻字符连接,是表情符号序列的关键(👩+ZWJ+💻=👩‍💻),在印度文字中请求形成连字,也可用于隐藏文本边界。

零宽非连接符 (ZWNJ)

U+200C,阻止相邻字符连接,在波斯语/阿拉伯语中对正确字母形式是必需的,也用于梵文中阻止连字。