Invisible Character Detection and Removal
Zero-width and other invisible Unicode characters can be used to fingerprint text for tracking, hide malicious payloads in code, or bypass content filters while remaining undetectable to the human eye. This guide explains how to detect, visualize, and remove invisible Unicode characters from user input and stored text using code examples in Python and JavaScript.
Invisible characters are among the most insidious features of Unicode from a security perspective. They occupy space in a string, affect text processing, and can alter rendering behavior — all without producing any visible output. From zero-width spaces that bypass keyword filters to bidirectional marks that reorder displayed text, invisible characters create a class of vulnerabilities that are invisible by definition. This guide catalogs the invisible characters in Unicode, explains their legitimate purposes, demonstrates how they can be abused, and provides detection and removal techniques for developers.
Catalog of Invisible Characters
Zero-Width Characters
These characters have zero advance width — they produce no visible glyph and occupy no horizontal space:
| Character | Code Point | Name | Legitimate Purpose |
|---|---|---|---|
| ZWSP | U+200B | Zero Width Space | Word-break opportunity without visible space |
| ZWNJ | U+200C | Zero Width Non-Joiner | Prevent ligature/joining (Persian, Indic) |
| ZWJ | U+200D | Zero Width Joiner | Force ligature/joining (emoji sequences) |
| WJ | U+2060 | Word Joiner | Prevent line break (replacement for U+FEFF) |
| ZWNBSP | U+FEFF | Zero Width No-Break Space / BOM | Byte order mark at file start |
| IA | U+2061 | Function Application | Mathematical notation |
| IT | U+2062 | Invisible Times | Mathematical notation |
| IS | U+2063 | Invisible Separator | Mathematical notation |
| IP | U+2064 | Invisible Plus | Mathematical notation |
Bidirectional Control Characters
These control the direction of text display without producing visible output:
| Character | Code Point | Name |
|---|---|---|
| LRM | U+200E | Left-to-Right Mark |
| RLM | U+200F | Right-to-Left Mark |
| LRE | U+202A | Left-to-Right Embedding |
| RLE | U+202B | Right-to-Left Embedding |
| U+202C | Pop Directional Formatting | |
| LRO | U+202D | Left-to-Right Override |
| RLO | U+202E | Right-to-Left Override |
| LRI | U+2066 | Left-to-Right Isolate |
| RLI | U+2067 | Right-to-Left Isolate |
| FSI | U+2068 | First Strong Isolate |
| PDI | U+2069 | Pop Directional Isolate |
| ALM | U+061C | Arabic Letter Mark |
Special Spaces and Format Characters
| Character | Code Point | Name | Width |
|---|---|---|---|
| NBSP | U+00A0 | No-Break Space | Same as regular space |
| NNBSP | U+202F | Narrow No-Break Space | Narrower than regular |
| MMSP | U+205F | Medium Mathematical Space | Medium |
| HAIR | U+200A | Hair Space | Very thin |
| THIN | U+2009 | Thin Space | Thin |
| SHY | U+00AD | Soft Hyphen | Zero (visible only at line break) |
| CGJ | U+034F | Combining Grapheme Joiner | Zero |
| IDSP | U+3164 | Hangul Filler | Full-width blank |
| ㅤ | U+3164 | Hangul Filler | Used as visible blank in Korean |
| ᅠ | U+115F | Hangul Choseong Filler | Leading consonant placeholder |
Variation Selectors
Unicode has 259 variation selectors (VS1–VS256 plus 3 base selectors) that modify the preceding character's glyph but are themselves invisible:
| Range | Code Points | Count |
|---|---|---|
| VS1–VS16 | U+FE00–U+FE0F | 16 |
| VS17–VS256 | U+E0100–U+E01EF | 240 |
| Mongolian VS | U+180B–U+180F | 5 |
Legitimate Uses
Invisible characters exist for good reasons:
ZWNJ in Persian/Arabic
In Persian, the ZWNJ (U+200C) is essential for correct spelling. It prevents unwanted joining between letters that should remain separate within a word:
| Without ZWNJ | With ZWNJ | Meaning |
|---|---|---|
| میخواهم | میخواهم | "I want" (correct with ZWNJ after می) |
Stripping ZWNJ from Persian text produces incorrect and potentially meaningless text.
ZWJ in Emoji
The ZWJ (U+200D) creates composite emoji by joining multiple emoji into a single glyph:
# Family emoji: Person + ZWJ + Person + ZWJ + Girl
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467"
print(family) # Displays as family emoji on supporting platforms
print(len(family)) # 5 code points
BOM (U+FEFF) as File Signature
The byte order mark at the beginning of a UTF-8 file (EF BB BF) signals UTF-8 encoding to editors and tools. While not recommended for new files, it is widespread in Windows-generated files.
Attack Vectors
1. Filter Bypass
# Keyword filter: block "password"
blocked_words = {"password", "admin", "root"}
# Attacker inserts ZWSP (U+200B)
malicious = "pass\u200Bword"
print(malicious in blocked_words) # False — bypasses filter
print(malicious) # Displays as "password" (ZWSP is invisible)
2. Username Impersonation
# Two visually identical usernames
user_a = "admin"
user_b = "admin\u200B" # with trailing ZWSP
user_c = "a\u200Bdmin" # with internal ZWSP
print(user_a == user_b) # False
print(user_a == user_c) # False
# All three display as "admin" in most interfaces
3. Text Watermarking
Invisible characters can create unique fingerprints in text to track copy-paste:
def watermark(text, user_id):
# Encode user_id as binary, insert as ZWJ/ZWNJ pattern
binary = format(user_id, "016b")
marker = ""
for bit in binary:
marker += "\u200D" if bit == "1" else "\u200C"
return text[:len(text)//2] + marker + text[len(text)//2:]
def extract_watermark(text):
bits = ""
for ch in text:
if ch == "\u200D":
bits += "1"
elif ch == "\u200C":
bits += "0"
return int(bits, 2) if bits else None
4. BiDi Source Code Attacks
# The Trojan Source vulnerability (CVE-2021-42574)
# BiDi overrides in comments or strings can reorder
# the visual presentation of source code
# Making malicious code appear benign in code review
# Detection: scan for BiDi control characters in source files
BIDI_CONTROLS = set(range(0x202A, 0x202F)) | set(range(0x2066, 0x206A))
Detection and Removal
Comprehensive Detection
import unicodedata
# All invisible/format characters worth detecting
INVISIBLE_CODEPOINTS = {
# Zero-width characters
0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF,
# Invisible math operators
0x2061, 0x2062, 0x2063, 0x2064,
# BiDi controls
0x200E, 0x200F,
0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
0x2066, 0x2067, 0x2068, 0x2069,
0x061C,
# Deprecated format chars
0x206A, 0x206B, 0x206C, 0x206D, 0x206E, 0x206F,
# Soft hyphen
0x00AD,
# Combining grapheme joiner
0x034F,
# Tag characters (U+E0001–U+E007F) — used in emoji flag sequences
# Variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF)
}
def detect_invisible(text):
findings = []
for i, ch in enumerate(text):
cp = ord(ch)
if cp in INVISIBLE_CODEPOINTS:
name = unicodedata.name(ch, f"U+{cp:04X}")
findings.append({
"position": i,
"codepoint": f"U+{cp:04X}",
"name": name,
"char": repr(ch),
})
# Also check for variation selectors
elif 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF:
findings.append({
"position": i,
"codepoint": f"U+{cp:04X}",
"name": unicodedata.name(ch, "VARIATION SELECTOR"),
"char": repr(ch),
})
return findings
# Usage
text = "Hello\u200B World\u200E!"
results = detect_invisible(text)
for r in results:
print(f" Position {r['position']}: {r['name']} ({r['codepoint']})")
Context-Aware Removal
def remove_invisible(text, preserve_zwj_emoji=True, preserve_zwnj_persian=False):
result = []
for i, ch in enumerate(text):
cp = ord(ch)
# Always remove BiDi overrides — rarely legitimate in user input
if cp in {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
0x2066, 0x2067, 0x2068, 0x2069}:
continue
# Preserve ZWJ in emoji contexts if requested
if cp == 0x200D and preserve_zwj_emoji:
if i > 0 and i < len(text) - 1:
prev_cp = ord(text[i - 1])
next_cp = ord(text[i + 1])
# Check if surrounded by emoji
if prev_cp > 0x1F000 or next_cp > 0x1F000:
result.append(ch)
continue
# Preserve ZWNJ in Persian/Arabic if requested
if cp == 0x200C and preserve_zwnj_persian:
result.append(ch)
continue
# Remove all other invisible characters
if cp in INVISIBLE_CODEPOINTS:
continue
result.append(ch)
return "".join(result)
JavaScript Detection
// Regex matching common invisible characters
const invisiblePattern = /[\u200B\u200C\u200D\u200E\u200F\u202A-\u202E\u2060-\u2064\u2066-\u2069\uFEFF\u00AD\u034F\u061C]/g;
function detectInvisible(text) {
const matches = [];
let match;
const regex = new RegExp(invisiblePattern.source, "g");
while ((match = regex.exec(text)) !== null) {
matches.push({
position: match.index,
codepoint: `U+${match[0].codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`,
});
}
return matches;
}
function removeInvisible(text) {
return text.replace(invisiblePattern, "");
}
Best Practices
| Context | Recommendation |
|---|---|
| Usernames | Strip all invisible characters; normalize to NFKC |
| Passwords | Allow ZWNJ/ZWJ (multilingual input); normalize before hashing |
| Search queries | Strip invisible chars; normalize to NFC |
| Source code | Reject BiDi overrides; warn on any invisible characters |
| Rich text | Preserve ZWNJ (Persian/Indic), ZWJ (emoji); strip overrides |
| Domain validation | Apply full IDN/UTS #39 checks |
| Keyword filters | Strip invisible characters before matching |
Key Takeaways
- Unicode contains 30+ invisible characters including zero-width spaces, BiDi controls, invisible math operators, and format characters — each with legitimate uses but significant abuse potential.
- ZWNJ is essential for Persian and Indic scripts, and ZWJ is essential for emoji sequences — naive stripping of all invisible characters breaks legitimate text.
- Invisible characters enable filter bypass (splitting blocked words), username impersonation (visually identical but distinct strings), text watermarking (tracking copy-paste), and source code attacks (BiDi reordering).
- Detection should use a comprehensive set of known invisible code points, and removal should be context-aware (preserving ZWJ in emoji, ZWNJ in Persian).
- BiDi override characters (U+202A–U+202E, U+2066–U+2069) should be stripped from almost all user input — they are rarely legitimate outside of specialized text editors.
- Always normalize text (NFC or NFKC) after removing invisible characters to collapse any remaining variations.
More in Unicode Security
Unicode's vast character set introduces a range of security vulnerabilities including homograph …
IDN homograph attacks use look-alike Unicode characters to register domain names that …
Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow …
Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to …