混合文字系统检测
识别混合不同文字系统字符的文本(如拉丁文+西里尔文),是防御同形字攻击的主要手段,浏览器据此触发Punycode显示。
What is Mixed-Script Detection?
Mixed-script detection is a security technique that identifies text strings containing characters from more than one Unicode script, flagging them as potentially deceptive. Because legitimate text in most contexts is written in a single script — English in Latin, Russian in Cyrillic, Arabic in the Arabic script — the presence of multiple scripts in a single identifier, domain name, or username is a strong signal of a spoofing attempt.
Unicode Technical Report #39 (Unicode Security Mechanisms) formalizes mixed-script detection as one of the primary defenses against homoglyph and confusable attacks.
Unicode Scripts
Unicode organizes characters into scripts — named systems of writing associated with particular languages and cultures. Every Unicode character (except a small set of "Common" and "Inherited" script characters) belongs to exactly one script. The Unicode Character Database includes a Scripts.txt property file assigning each code point to its script.
Examples of scripts: Latin, Cyrillic, Greek, Armenian, Hebrew, Arabic, Devanagari, Bengali, CJK (Han), Hiragana, Katakana, Thai, Georgian.
A handful of characters — digits (0–9), punctuation like ., -, @ — have script property Common and are allowed in any script context without triggering mixed-script detection.
How Mixed-Script Detection Works
The algorithm examines all characters in a string and collects the set of scripts represented:
- Characters with script Common or Inherited are ignored for mixing purposes
- Characters with a specific script (Latin, Cyrillic, etc.) are added to the script set
- If the resulting set contains more than one script, the string is mixed-script
For example:
paypal— all Latin — single script, cleanраураl— Cyrillic р, а, у, р, а + Latin l — mixed script, flaggedmünchen— Latin + Common (no mixing concern) — single script, cleanаррlе— Cyrillic а, р, р + Latin l, е — mixed script, flagged
Augmented Script Sets
Unicode TR39 defines the concept of augmented script sets to handle characters that are legitimately used across scripts. For example, Han characters are used in both Japanese (combined with Hiragana and Katakana) and Chinese. The augmented script sets expand the "allowed combinations" to prevent false positives for legitimate multilingual text such as Japanese.
This means Japanese text containing Hiragana, Katakana, and Han characters is not flagged as mixed-script because all three are in Japan's augmented script set. Only truly suspicious combinations — Latin mixed with Cyrillic, for example — trigger the detection.
Spoof Checks Defined in TR39
Unicode TR39 defines four formal spoof check levels:
- Single-script confusable: A single-script string that is confusable with another single-script string (e.g., all-Cyrillic lookalike of a Latin word)
- Mixed-script confusable: A string mixing scripts where replacing characters would produce a confusable string in a single script
- Whole-script confusable: An entire string in script X that is confusable with a string in script Y
- Any-case confusable: The above checks applied case-insensitively
Implementation in Practice
Browser implementations use mixed-script detection to decide whether to display an internationalized domain name as Unicode or fall back to Punycode. Domain registrars apply it to block registration of mixed-script domains. Programming language toolchains use it to warn about suspicious identifiers.
Python 3 uses a variant of this check for source code identifiers. The unicodedata module exposes script information, and third-party libraries like confusable-homoglyphs implement full TR39 checks.
Quick Facts
| Property | Value |
|---|---|
| Governing standard | Unicode TR39 — Unicode Security Mechanisms |
| Script data file | Scripts.txt in Unicode Character Database |
| Special script values | Common, Inherited (excluded from mixing checks) |
| Japanese exception | Han + Hiragana + Katakana allowed via augmented sets |
| Primary defense against | IDN homograph attacks, username spoofing |
| Browser application | Determines Unicode vs. Punycode URL rendering |
| Related concept | Whole-script confusable, confusables dataset |
相关术语
安全 中的更多内容
Exploiting Unicode bidirectional control characters to disguise malicious code or filenames. The …
在域名中使用视觉上相似的Unicode字符来冒充合法网站的攻击,аpple.com(西里尔а)看起来像apple.com,浏览器通过Punycode显示规则加以防范。
Exploiting Unicode normalization to bypass security filters. Input validated before normalization may …
利用Unicode功能欺骗用户:同形字用于假冒域名,双向覆盖用于伪造文件扩展名,不可见字符用于隐藏文本。
利用Unicode双向覆盖字符(U+202A–U+202E、U+2066–U+2069)伪装恶意文件名或代码的攻击,'readmefdp.exe'显示为'readmeexe.pdf'。
来自不同文字系统但外观相同或非常相似的字符,如拉丁'a'与西里尔'а',用于网络钓鱼、欺骗和社会工程学攻击。
Unicode对视觉上可能混淆的字符对的官方术语,定义于confusables.txt(UCD),比同形字范围更广,包含仅仅相似而非完全相同的字符。
U+200D,请求相邻字符连接,是表情符号序列的关键(👩+ZWJ+💻=👩💻),在印度文字中请求形成连字,也可用于隐藏文本边界。
U+200C,阻止相邻字符连接,在波斯语/阿拉伯语中对正确字母形式是必需的,也用于梵文中阻止连字。