🔒 Unicode Security

Invisible Character Detection and Removal

Zero-width and other invisible Unicode characters can be used to fingerprint text for tracking, hide malicious payloads in code, or bypass content filters while remaining undetectable to the human eye. This guide explains how to detect, visualize, and remove invisible Unicode characters from user input and stored text using code examples in Python and JavaScript.

·

Invisible characters are among the most insidious features of Unicode from a security perspective. They occupy space in a string, affect text processing, and can alter rendering behavior — all without producing any visible output. From zero-width spaces that bypass keyword filters to bidirectional marks that reorder displayed text, invisible characters create a class of vulnerabilities that are invisible by definition. This guide catalogs the invisible characters in Unicode, explains their legitimate purposes, demonstrates how they can be abused, and provides detection and removal techniques for developers.

Catalog of Invisible Characters

Zero-Width Characters

These characters have zero advance width — they produce no visible glyph and occupy no horizontal space:

Character Code Point Name Legitimate Purpose
ZWSP U+200B Zero Width Space Word-break opportunity without visible space
ZWNJ U+200C Zero Width Non-Joiner Prevent ligature/joining (Persian, Indic)
ZWJ U+200D Zero Width Joiner Force ligature/joining (emoji sequences)
WJ U+2060 Word Joiner Prevent line break (replacement for U+FEFF)
ZWNBSP U+FEFF Zero Width No-Break Space / BOM Byte order mark at file start
IA U+2061 Function Application Mathematical notation
IT U+2062 Invisible Times Mathematical notation
IS U+2063 Invisible Separator Mathematical notation
IP U+2064 Invisible Plus Mathematical notation

Bidirectional Control Characters

These control the direction of text display without producing visible output:

Character Code Point Name
LRM U+200E Left-to-Right Mark
RLM U+200F Right-to-Left Mark
LRE U+202A Left-to-Right Embedding
RLE U+202B Right-to-Left Embedding
PDF U+202C Pop Directional Formatting
LRO U+202D Left-to-Right Override
RLO U+202E Right-to-Left Override
LRI U+2066 Left-to-Right Isolate
RLI U+2067 Right-to-Left Isolate
FSI U+2068 First Strong Isolate
PDI U+2069 Pop Directional Isolate
ALM U+061C Arabic Letter Mark

Special Spaces and Format Characters

Character Code Point Name Width
NBSP U+00A0 No-Break Space Same as regular space
NNBSP U+202F Narrow No-Break Space Narrower than regular
MMSP U+205F Medium Mathematical Space Medium
HAIR U+200A Hair Space Very thin
THIN U+2009 Thin Space Thin
SHY U+00AD Soft Hyphen Zero (visible only at line break)
CGJ U+034F Combining Grapheme Joiner Zero
IDSP U+3164 Hangul Filler Full-width blank
U+3164 Hangul Filler Used as visible blank in Korean
U+115F Hangul Choseong Filler Leading consonant placeholder

Variation Selectors

Unicode has 259 variation selectors (VS1–VS256 plus 3 base selectors) that modify the preceding character's glyph but are themselves invisible:

Range Code Points Count
VS1–VS16 U+FE00–U+FE0F 16
VS17–VS256 U+E0100–U+E01EF 240
Mongolian VS U+180B–U+180F 5

Legitimate Uses

Invisible characters exist for good reasons:

ZWNJ in Persian/Arabic

In Persian, the ZWNJ (U+200C) is essential for correct spelling. It prevents unwanted joining between letters that should remain separate within a word:

Without ZWNJ With ZWNJ Meaning
میخواهم می‌خواهم "I want" (correct with ZWNJ after می)

Stripping ZWNJ from Persian text produces incorrect and potentially meaningless text.

ZWJ in Emoji

The ZWJ (U+200D) creates composite emoji by joining multiple emoji into a single glyph:

# Family emoji: Person + ZWJ + Person + ZWJ + Girl
family = "\U0001F468\u200D\U0001F469\u200D\U0001F467"
print(family)  # Displays as family emoji on supporting platforms
print(len(family))  # 5 code points

BOM (U+FEFF) as File Signature

The byte order mark at the beginning of a UTF-8 file (EF BB BF) signals UTF-8 encoding to editors and tools. While not recommended for new files, it is widespread in Windows-generated files.

Attack Vectors

1. Filter Bypass

# Keyword filter: block "password"
blocked_words = {"password", "admin", "root"}

# Attacker inserts ZWSP (U+200B)
malicious = "pass\u200Bword"
print(malicious in blocked_words)  # False — bypasses filter
print(malicious)  # Displays as "password" (ZWSP is invisible)

2. Username Impersonation

# Two visually identical usernames
user_a = "admin"
user_b = "admin\u200B"       # with trailing ZWSP
user_c = "a\u200Bdmin"       # with internal ZWSP

print(user_a == user_b)  # False
print(user_a == user_c)  # False
# All three display as "admin" in most interfaces

3. Text Watermarking

Invisible characters can create unique fingerprints in text to track copy-paste:

def watermark(text, user_id):
    # Encode user_id as binary, insert as ZWJ/ZWNJ pattern
    binary = format(user_id, "016b")
    marker = ""
    for bit in binary:
        marker += "\u200D" if bit == "1" else "\u200C"
    return text[:len(text)//2] + marker + text[len(text)//2:]

def extract_watermark(text):
    bits = ""
    for ch in text:
        if ch == "\u200D":
            bits += "1"
        elif ch == "\u200C":
            bits += "0"
    return int(bits, 2) if bits else None

4. BiDi Source Code Attacks

# The Trojan Source vulnerability (CVE-2021-42574)
# BiDi overrides in comments or strings can reorder
# the visual presentation of source code
# Making malicious code appear benign in code review

# Detection: scan for BiDi control characters in source files
BIDI_CONTROLS = set(range(0x202A, 0x202F)) | set(range(0x2066, 0x206A))

Detection and Removal

Comprehensive Detection

import unicodedata

# All invisible/format characters worth detecting
INVISIBLE_CODEPOINTS = {
    # Zero-width characters
    0x200B, 0x200C, 0x200D, 0x2060, 0xFEFF,
    # Invisible math operators
    0x2061, 0x2062, 0x2063, 0x2064,
    # BiDi controls
    0x200E, 0x200F,
    0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
    0x2066, 0x2067, 0x2068, 0x2069,
    0x061C,
    # Deprecated format chars
    0x206A, 0x206B, 0x206C, 0x206D, 0x206E, 0x206F,
    # Soft hyphen
    0x00AD,
    # Combining grapheme joiner
    0x034F,
    # Tag characters (U+E0001–U+E007F) — used in emoji flag sequences
    # Variation selectors (U+FE00–U+FE0F, U+E0100–U+E01EF)
}

def detect_invisible(text):
    findings = []
    for i, ch in enumerate(text):
        cp = ord(ch)
        if cp in INVISIBLE_CODEPOINTS:
            name = unicodedata.name(ch, f"U+{cp:04X}")
            findings.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": name,
                "char": repr(ch),
            })
        # Also check for variation selectors
        elif 0xFE00 <= cp <= 0xFE0F or 0xE0100 <= cp <= 0xE01EF:
            findings.append({
                "position": i,
                "codepoint": f"U+{cp:04X}",
                "name": unicodedata.name(ch, "VARIATION SELECTOR"),
                "char": repr(ch),
            })
    return findings

# Usage
text = "Hello\u200B World\u200E!"
results = detect_invisible(text)
for r in results:
    print(f"  Position {r['position']}: {r['name']} ({r['codepoint']})")

Context-Aware Removal

def remove_invisible(text, preserve_zwj_emoji=True, preserve_zwnj_persian=False):
    result = []
    for i, ch in enumerate(text):
        cp = ord(ch)
        # Always remove BiDi overrides — rarely legitimate in user input
        if cp in {0x202A, 0x202B, 0x202C, 0x202D, 0x202E,
                  0x2066, 0x2067, 0x2068, 0x2069}:
            continue
        # Preserve ZWJ in emoji contexts if requested
        if cp == 0x200D and preserve_zwj_emoji:
            if i > 0 and i < len(text) - 1:
                prev_cp = ord(text[i - 1])
                next_cp = ord(text[i + 1])
                # Check if surrounded by emoji
                if prev_cp > 0x1F000 or next_cp > 0x1F000:
                    result.append(ch)
                    continue
        # Preserve ZWNJ in Persian/Arabic if requested
        if cp == 0x200C and preserve_zwnj_persian:
            result.append(ch)
            continue
        # Remove all other invisible characters
        if cp in INVISIBLE_CODEPOINTS:
            continue
        result.append(ch)
    return "".join(result)

JavaScript Detection

// Regex matching common invisible characters
const invisiblePattern = /[\u200B\u200C\u200D\u200E\u200F\u202A-\u202E\u2060-\u2064\u2066-\u2069\uFEFF\u00AD\u034F\u061C]/g;

function detectInvisible(text) {
  const matches = [];
  let match;
  const regex = new RegExp(invisiblePattern.source, "g");
  while ((match = regex.exec(text)) !== null) {
    matches.push({
      position: match.index,
      codepoint: `U+${match[0].codePointAt(0).toString(16).toUpperCase().padStart(4, "0")}`,
    });
  }
  return matches;
}

function removeInvisible(text) {
  return text.replace(invisiblePattern, "");
}

Best Practices

Context Recommendation
Usernames Strip all invisible characters; normalize to NFKC
Passwords Allow ZWNJ/ZWJ (multilingual input); normalize before hashing
Search queries Strip invisible chars; normalize to NFC
Source code Reject BiDi overrides; warn on any invisible characters
Rich text Preserve ZWNJ (Persian/Indic), ZWJ (emoji); strip overrides
Domain validation Apply full IDN/UTS #39 checks
Keyword filters Strip invisible characters before matching

Key Takeaways

  • Unicode contains 30+ invisible characters including zero-width spaces, BiDi controls, invisible math operators, and format characters — each with legitimate uses but significant abuse potential.
  • ZWNJ is essential for Persian and Indic scripts, and ZWJ is essential for emoji sequences — naive stripping of all invisible characters breaks legitimate text.
  • Invisible characters enable filter bypass (splitting blocked words), username impersonation (visually identical but distinct strings), text watermarking (tracking copy-paste), and source code attacks (BiDi reordering).
  • Detection should use a comprehensive set of known invisible code points, and removal should be context-aware (preserving ZWJ in emoji, ZWNJ in Persian).
  • BiDi override characters (U+202A–U+202E, U+2066–U+2069) should be stripped from almost all user input — they are rarely legitimate outside of specialized text editors.
  • Always normalize text (NFC or NFKC) after removing invisible characters to collapse any remaining variations.

Unicode Security의 더 많은 가이드