安全

同形字

来自不同文字系统但外观相同或非常相似的字符,如拉丁'a'与西里尔'а',用于网络钓鱼、欺骗和社会工程学攻击。

· Updated

What is a Homoglyph?

A homoglyph is a character that looks visually identical or nearly identical to another character but has a completely different Unicode code point, name, and meaning. The word comes from the Greek homos (same) and glyphe (carving or symbol). Because modern typefaces render these characters with the same shape on screen, human eyes — and sometimes software — cannot distinguish between them.

The most well-known example is the Latin lowercase letter a (U+0061) and the Cyrillic lowercase letter а (U+0430). They are rendered identically in most fonts, yet they are entirely different code points belonging to different Unicode scripts. Dozens of such pairs exist across Latin, Greek, Cyrillic, Armenian, and many other scripts.

Why Homoglyphs Are a Security Problem

The Unicode Standard encodes characters from over 150 scripts, and many scripts independently developed symbols that resemble those in other scripts. This is expected and linguistically valid. The security problem arises when attackers deliberately substitute one character for another to trick users into believing they are looking at something they are not.

Common targets include:

  • Domain names: The Internationalized Domain Names in Applications (IDNA) standard allows non-ASCII characters in domain names. An attacker can register pаypal.com using a Cyrillic а and create a convincing phishing site that appears to be paypal.com to a casual viewer.
  • Usernames and handles: Social platforms that allow Unicode usernames are vulnerable to impersonation attacks where a fake account mimics a real one character-for-character.
  • Source code and filenames: Homoglyphs in variable names or filenames can introduce subtle backdoors that are nearly impossible to spot during code review.

Common Homoglyph Pairs

Many scripts contribute characters that visually overlap with Latin letters:

  • Latin o (U+006F), Cyrillic о (U+043E), Greek ο (U+03BF) — all look like "o"
  • Latin p (U+0070) and Cyrillic р (U+0440) — identical lowercase forms
  • Latin c (U+0063) and Cyrillic с (U+0441) — identical lowercase forms
  • Latin e (U+0065) and Cyrillic е (U+0435) — identical lowercase forms
  • Latin H (U+0048) and Cyrillic Н (U+041D) — identical uppercase forms

This means the word "COPE" written entirely in Cyrillic characters — СОРЕ — looks exactly like the Latin word "COPE" in most fonts.

How to Detect and Prevent Homoglyph Attacks

Unicode Technical Report #39 (Unicode Security Mechanisms) defines a confusables dataset that maps thousands of characters to their "safe" visual equivalents. Software can use this dataset to normalize or flag suspicious text.

Common defenses include:

  1. Script mixing detection — reject or warn when a string contains characters from more than one script
  2. Confusables normalization — map potentially confusing characters to a canonical form before storage or comparison
  3. Punycode display — browsers display internationalized domain names in Punycode (xn--...) form when mixed scripts are detected
  4. Visual diff tools — security-aware editors can highlight characters that are not in the expected script

Quick Facts

Property Value
Term origin Greek homos (same) + glyphe (symbol)
Key Unicode document Unicode TR39 — Unicode Security Mechanisms
Confusables data file confusables.txt in Unicode Character Database
Most exploited scripts Latin, Cyrillic, Greek, Armenian
Primary attack surface Domain names (IDN), usernames, source code
Browser defense Punycode fallback for mixed-script domains
Related term Confusable, IDN homograph attack

相关术语

安全 中的更多内容

Bidi Text Attack

Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …

IDN 同形字攻击

在域名中使用视觉上相似的Unicode字符来冒充合法网站的攻击,аpple.com(西里尔а)看起来像apple.com,浏览器通过Punycode显示规则加以防范。

Normalization Attack

Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …

Unicode 欺骗

利用Unicode功能欺骗用户:同形字用于假冒域名,双向覆盖用于伪造文件扩展名,不可见字符用于隐藏文本。

双向覆盖攻击

利用Unicode双向覆盖字符(U+202A–U+202E、U+2066–U+2069)伪装恶意文件名或代码的攻击,'readme‮fdp.exe'显示为'readmeexe.pdf'。

易混淆字符

Unicode对视觉上可能混淆的字符对的官方术语,定义于confusables.txt(UCD),比同形字范围更广,包含仅仅相似而非完全相同的字符。

混合文字系统检测

识别混合不同文字系统字符的文本(如拉丁文+西里尔文),是防御同形字攻击的主要手段,浏览器据此触发Punycode显示。

零宽连接符 (ZWJ)

U+200D,请求相邻字符连接,是表情符号序列的关键(👩+ZWJ+💻=👩‍💻),在印度文字中请求形成连字,也可用于隐藏文本边界。

零宽非连接符 (ZWNJ)

U+200C,阻止相邻字符连接,在波斯语/阿拉伯语中对正确字母形式是必需的,也用于梵文中阻止连字。