Invisible Character Detection and Removal

Invisible characters are among the most insidious features of Unicode from a security perspective. They occupy space in a string, affect text processing, and can alter rendering behavior — all without producing any visible output. From zero-width spaces that bypass keyword filters to bidirectional marks that reorder displayed text, invisible characters create a class of vulnerabilities that are invisible by definition. This guide catalogs the invisible characters in Unicode, explains their legitimate purposes, demonstrates how they can be abused, and provides detection and removal techniques for developers.

Catalog of Invisible Characters

Zero-Width Characters

These characters have zero advance width — they produce no visible glyph and occupy no horizontal space:

Character	Code Point	Name	Legitimate Purpose
ZWSP	U+200B	Zero Width Space	Word-break opportunity without visible space
ZWNJ	U+200C	Zero Width Non-Joiner	Prevent ligature/joining (Persian, Indic)
ZWJ	U+200D	Zero Width Joiner	Force ligature/joining (emoji sequences)
WJ	U+2060	Word Joiner	Prevent line break (replacement for U+FEFF)
ZWNBSP	U+FEFF	Zero Width No-Break Space / BOM	Byte order mark at file start
IA	U+2061	Function Application	Mathematical notation
IT	U+2062	Invisible Times	Mathematical notation
IS	U+2063	Invisible Separator	Mathematical notation
IP	U+2064	Invisible Plus	Mathematical notation

Bidirectional Control Characters

These control the direction of text display without producing visible output:

Character	Code Point	Name
LRM	U+200E	Left-to-Right Mark
RLM	U+200F	Right-to-Left Mark
LRE	U+202A	Left-to-Right Embedding
RLE	U+202B	Right-to-Left Embedding
PDF	U+202C	Pop Directional Formatting
LRO	U+202D	Left-to-Right Override
RLO	U+202E	Right-to-Left Override
LRI	U+2066	Left-to-Right Isolate
RLI	U+2067	Right-to-Left Isolate
FSI	U+2068	First Strong Isolate
PDI	U+2069	Pop Directional Isolate
ALM	U+061C	Arabic Letter Mark

Special Spaces and Format Characters

Character	Code Point	Name	Width
NBSP	U+00A0	No-Break Space	Same as regular space
NNBSP	U+202F	Narrow No-Break Space	Narrower than regular
MMSP	U+205F	Medium Mathematical Space	Medium
HAIR	U+200A	Hair Space	Very thin
THIN	U+2009	Thin Space	Thin
SHY	U+00AD	Soft Hyphen	Zero (visible only at line break)
CGJ	U+034F	Combining Grapheme Joiner	Zero
IDSP	U+3164	Hangul Filler	Full-width blank
ㅤ	U+3164	Hangul Filler	Used as visible blank in Korean
ᅠ	U+115F	Hangul Choseong Filler	Leading consonant placeholder

Variation Selectors

Unicode has 259 variation selectors (VS1–VS256 plus 3 base selectors) that modify the preceding character's glyph but are themselves invisible:

Range	Code Points	Count
VS1–VS16	U+FE00–U+FE0F	16
VS17–VS256	U+E0100–U+E01EF	240
Mongolian VS	U+180B–U+180F	5

Legitimate Uses

Invisible characters exist for good reasons:

ZWNJ in Persian/Arabic

In Persian, the ZWNJ (U+200C) is essential for correct spelling. It prevents unwanted joining between letters that should remain separate within a word:

Without ZWNJ	With ZWNJ	Meaning
میخواهم	می‌خواهم	"I want" (correct with ZWNJ after می)

Stripping ZWNJ from Persian text produces incorrect and potentially meaningless text.

ZWJ in Emoji

The ZWJ (U+200D) creates composite emoji by joining multiple emoji into a single glyph:

# Family emoji: Person + ZWJ + Person + ZWJ + Girl
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467"
print(family)  # Displays as family emoji on supporting platforms
print(len(family))  # 5 code points

BOM (U+FEFF) as File Signature

The byte order mark at the beginning of a UTF-8 file (EF BB BF) signals UTF-8 encoding to editors and tools. While not recommended for new files, it is widespread in Windows-generated files.

Attack Vectors

1. Filter Bypass

# Keyword filter: block "password"
blocked_words = {"password", "admin", "root"}

# Attacker inserts ZWSP (U+200B)
malicious = "pass\u200Bword"
print(malicious in blocked_words)  # False — bypasses filter
print(malicious)  # Displays as "password" (ZWSP is invisible)

2. Username Impersonation

# Two visually identical usernames
user_a = "admin"
user_b = "admin\u200B"       # with trailing ZWSP
user_c = "a\u200Bdmin"       # with internal ZWSP

print(user_a == user_b)  # False
print(user_a == user_c)  # False
# All three display as "admin" in most interfaces

3. Text Watermarking

Invisible characters can create unique fingerprints in text to track copy-paste:

def watermark(text, user_id):
    # Encode user_id as binary, insert as ZWJ/ZWNJ pattern
    binary = format(user_id, "016b")
    marker = ""
    for bit in binary:
        marker += "\u200D" if bit == "1" else "\u200C"
    return text[:len(text)//2] + marker + text[len(text)//2:]

def extract_watermark(text):
    bits = ""
    for ch in text:
        if ch == "\u200D":
            bits += "1"
        elif ch == "\u200C":
            bits += "0"
    return int(bits, 2) if bits else None

4. BiDi Source Code Attacks

# The Trojan Source vulnerability (CVE-2021-42574)
# BiDi overrides in comments or strings can reorder
# the visual presentation of source code
# Making malicious code appear benign in code review

# Detection: scan for BiDi control characters in source files
BIDI_CONTROLS = set(range(0x202A, 0x202F)) | set(range(0x2066, 0x206A))

Detection and Removal

Comprehensive Detection

import unicodedata

# All invisible/format characters worth detecting
INVISIBLE_CODEPOINTS = {
    # Zero-width characters
    0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF,
    # Invisible math operators
    0x2061, 0x2062, 0x2063, 0x2064,
    # BiDi controls
    0x200E, 0x200F,
    0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
    0x2066, 0x2067, 0x2068, 0x2069,
    0x061C,
    # Deprecated format chars
    0x206A, 0x206B, 0x206C, 0x206D, 0x206E, 0x206F,
    # Soft hyphen
    0x00AD,
    # Combining grapheme joiner
    0x034F,
    # Tag characters (U+E0001–U+E007F) — used in emoji flag sequences
    # Variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF)
}

def detect_invisible(text):
    findings = []
    for i, ch in enumerate(text):
        cp = ord(ch)
        if cp in INVISIBLE_CODEPOINTS:
            name = unicodedata.name(ch, f"U+{cp:04X}")
            findings.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": name,
                "char": repr(ch),
            })
        # Also check for variation selectors
        elif 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF:
            findings.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": unicodedata.name(ch, "VARIATION SELECTOR"),
                "char": repr(ch),
            })
    return findings

# Usage
text = "Hello\u200B World\u200E!"
results = detect_invisible(text)
for r in results:
    print(f"  Position {r['position']}: {r['name']} ({r['codepoint']})")

Context-Aware Removal

def remove_invisible(text, preserve_zwj_emoji=True, preserve_zwnj_persian=False):
    result = []
    for i, ch in enumerate(text):
        cp = ord(ch)
        # Always remove BiDi overrides — rarely legitimate in user input
        if cp in {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2066, 0x2067, 0x2068, 0x2069}:
            continue
        # Preserve ZWJ in emoji contexts if requested
        if cp == 0x200D and preserve_zwj_emoji:
            if i > 0 and i < len(text) - 1:
                prev_cp = ord(text[i - 1])
                next_cp = ord(text[i + 1])
                # Check if surrounded by emoji
                if prev_cp > 0x1F000 or next_cp > 0x1F000:
                    result.append(ch)
                    continue
        # Preserve ZWNJ in Persian/Arabic if requested
        if cp == 0x200C and preserve_zwnj_persian:
            result.append(ch)
            continue
        # Remove all other invisible characters
        if cp in INVISIBLE_CODEPOINTS:
            continue
        result.append(ch)
    return "".join(result)

JavaScript Detection

// Regex matching common invisible characters
const invisiblePattern = /[\u200B\u200C\u200D\u200E\u200F\u202A-\u202E\u2060-\u2064\u2066-\u2069\uFEFF\u00AD\u034F\u061C]/g;

function detectInvisible(text) {
  const matches = [];
  let match;
  const regex = new RegExp(invisiblePattern.source, "g");
  while ((match = regex.exec(text)) !== null) {
    matches.push({
      position: match.index,
      codepoint: `U+${match[0].codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`,
    });
  }
  return matches;
}

function removeInvisible(text) {
  return text.replace(invisiblePattern, "");
}

Best Practices

Context	Recommendation
Usernames	Strip all invisible characters; normalize to NFKC
Passwords	Allow ZWNJ/ZWJ (multilingual input); normalize before hashing
Search queries	Strip invisible chars; normalize to NFC
Source code	Reject BiDi overrides; warn on any invisible characters
Rich text	Preserve ZWNJ (Persian/Indic), ZWJ (emoji); strip overrides
Domain validation	Apply full IDN/UTS #39 checks
Keyword filters	Strip invisible characters before matching

Key Takeaways

Unicode contains 30+ invisible characters including zero-width spaces, BiDi controls, invisible math operators, and format characters — each with legitimate uses but significant abuse potential.
ZWNJ is essential for Persian and Indic scripts, and ZWJ is essential for emoji sequences — naive stripping of all invisible characters breaks legitimate text.
Invisible characters enable filter bypass (splitting blocked words), username impersonation (visually identical but distinct strings), text watermarking (tracking copy-paste), and source code attacks (BiDi reordering).
Detection should use a comprehensive set of known invisible code points, and removal should be context-aware (preserving ZWJ in emoji, ZWNJ in Persian).
BiDi override characters (U+202A–U+202E, U+2066–U+2069) should be stripped from almost all user input — they are rarely legitimate outside of specialized text editors.
Always normalize text (NFC or NFKC) after removing invisible characters to collapse any remaining variations.