📚 Unicode Fundamentals

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This guide explains the most important zero-width characters, their legitimate uses, and how they are abused for data exfiltration and plagiarism detection.

Published 2021-11-22 · Updated 2024-12-16

Some Unicode characters occupy no visible space. They render as nothing. You cannot see them, cannot select them by clicking around them in most text editors, and cannot detect them with a casual glance at a string's .length property. Yet they are present — silently influencing text rendering, word breaking, ligature formation, and increasingly, user tracking and security exploits. These are zero-width characters, and every developer working with user-supplied text needs to understand them.

The Zero-Width Characters

Unicode defines a family of characters that produce no visible glyph. Here are the most important ones:

Character	Code Point	Name	Primary Purpose
ZWSP	U+200B	ZERO WIDTH SPACE	Suggest line-break opportunity without visible space
ZWJ	U+200D	ZERO WIDTH JOINER	Join adjacent characters (e.g., emoji sequences)
ZWNJ	U+200C	ZERO WIDTH NON-JOINER	Prevent joining (e.g., in Arabic or Persian)
WJ	U+2060	WORD JOINER	Prevent line break (ZWSP's opposite)
ZWNBSP	U+FEFF	ZERO WIDTH NO-BREAK SPACE	Originally BOM; also prevents break
SHY	U+00AD	SOFT HYPHEN	Suggests hyphenation point; invisible if no break
LRMI / RLMI	U+200E / U+200F	LEFT-TO-RIGHT MARK / RIGHT-TO-LEFT MARK	Force text direction

In Python:

zwsp  = "\u200b"   # ZERO WIDTH SPACE
zwj   = "\u200d"   # ZERO WIDTH JOINER
zwnj  = "\u200c"   # ZERO WIDTH NON-JOINER
wj    = "\u2060"   # WORD JOINER
feff  = "\ufeff"   # ZERO WIDTH NO-BREAK SPACE (BOM)

# All have length 1 but render as nothing
for c in [zwsp, zwj, zwnj, wj, feff]:
    print(f"U+{ord(c):04X}: len={len(c)}, repr={repr(c)}")

Legitimate Uses

Zero-width characters have genuine, important uses in text typography and script rendering. Understanding the legitimate uses prevents over-aggressive filtering.

ZERO WIDTH SPACE (U+200B) — Line Break Hints

Some languages write words without spaces (Thai, Japanese, Chinese). ZWSP can be inserted at valid word boundaries to hint to the browser's line-breaking algorithm where it may wrap:

<!-- Long Thai URL that can line-break at ZWSP positions -->
<p>เดินทางไป&#x200B;กรุงเทพ&#x200B;มหานคร</p>

This is semantically clean: ZWSP means "this is a valid place to break the line" without adding a visible space.

ZERO WIDTH JOINER (U+200D) — Emoji Sequences

ZWJ (U+200D) is the backbone of complex emoji. Modern emoji like family emoji, profession emoji, and skin-tone combinations are built from sequences of simpler emoji joined by ZWJ:

# Family: man + ZWJ + woman + ZWJ + girl + ZWJ + boy
family = "\U0001F468\u200d\U0001F469\u200d\U0001F467\u200d\U0001F466"
print(family)   # 👨‍👩‍👧‍👦 (single rendered glyph on supporting platforms)
print(len(family))  # 7 code points (4 emoji + 3 ZWJ)

# Profession emoji: woman + ZWJ + laptop
developer = "\U0001F469\u200d\U0001F4BB"
print(developer)   # 👩‍💻

Without ZWJ awareness, processing emoji sequences one code point at a time will split joined family emoji into their component parts — a common source of emoji-handling bugs.

ZERO WIDTH NON-JOINER (U+200C) — Script Cursive Breaking

Arabic and Persian are cursive scripts where adjacent letters naturally join. ZWNJ (U+200C) forces two adjacent characters to appear in their isolated forms, as if they were not adjacent.

In Persian, this is grammatically necessary: the suffix "‌ها" (plural marker) must not join with the preceding noun in certain contexts:

# Without ZWNJ: letters join cursively (may be grammatically wrong)
# With ZWNJ: letters appear isolated
persian_correct = "کتاب\u200cها"  # کتاب + ZWNJ + ها = "books" (correct non-joining)

SOFT HYPHEN (U+00AD) — Hyphenation Hints

SHY (U+00AD) is invisible but tells the browser: "if you need to break this word here, insert a hyphen at this point." It's used for long technical terms in justified text:

<p>This is a very long word: anti&#xAD;dis&#xAD;estab&#xAD;lish&#xAD;ment</p>

Only the hyphen at the actual break point is rendered; all others remain invisible.

Zero-Width Characters as Security Threats

The same invisibility that makes zero-width characters useful for typography makes them dangerous when injected into identifiers, passwords, or tracking payloads.

Text Fingerprinting / Watermarking

Zero-width characters can be used to uniquely encode a bit pattern within text, invisibly tagging each copy distributed to a different recipient. By varying the presence or absence of ZWSP, ZWJ, and ZWNJ at specific positions, an attacker or leaker-tracker can embed a binary identifier:

import unicodedata

# Simple example: encode a 3-bit ID as zero-width characters
# 0 = no character, 1 = ZWSP
def encode_watermark(text: str, bit_id: int, num_bits: int = 3) -> str:
    '''Embed a binary watermark into text using ZWSP insertions.'''
    zwsp = "\u200b"
    result = list(text)
    positions = [i for i, c in enumerate(text) if c == " "][:num_bits]
    for i, pos in enumerate(positions):
        if (bit_id >> (num_bits - 1 - i)) & 1:
            result.insert(pos, zwsp)
    return "".join(result)

def decode_watermark(text: str, positions: list[int]) -> int:
    '''Extract watermark bit pattern from text.'''
    zwsp = "\u200b"
    chars = list(text)
    bits = 0
    for i, pos in enumerate(positions):
        if pos < len(chars) and chars[pos] == zwsp:
            bits |= 1 << (len(positions) - 1 - i)
    return bits

This technique has been used to identify which internal source leaked a confidential document. With 16 zero-width characters, you can encode 65,536 unique IDs — enough to tag every employee in a large organization.

Password Confusion

A user's password contains an invisible ZWSP. When they copy-paste it, the zero-width character comes along. The password appears correct on screen but fails authentication. This is a particularly insidious support burden.

# Password stored: "secret\u200bpass" (10 code points)
# User types:      "secretpass"       (9 code points)
# They look identical → support ticket

stored = "secret\u200bpass"
typed  = "secretpass"

print(stored == typed)   # False — because of ZWSP!
print(stored.replace("\u200b", "") == typed)  # True

Identifier Spoofing

Zero-width characters can be inserted into usernames, variable names, or identifiers to create strings that display identically but compare as different:

real_admin  = "admin"
fake_admin  = "adm\u200bin"   # admin + ZWSP in the middle

print(real_admin)   # admin
print(fake_admin)   # admin (visually identical)
print(real_admin == fake_admin)  # False

This enables account spoofing attacks in applications that display usernames without rendering zero-width characters visibly.

Code Injection via Zero-Width Characters

In some template engines and markdown parsers, zero-width characters inside strings can break escaping logic or syntax parsing, since the parser doesn't expect invisible characters inside what appears to be a clean identifier.

Detection

Finding Zero-Width Characters

ZERO_WIDTH_CHARS = {
    "\u200b",  # ZERO WIDTH SPACE
    "\u200c",  # ZERO WIDTH NON-JOINER
    "\u200d",  # ZERO WIDTH JOINER
    "\u200e",  # LEFT-TO-RIGHT MARK
    "\u200f",  # RIGHT-TO-LEFT MARK
    "\u2060",  # WORD JOINER
    "\u2061",  # FUNCTION APPLICATION
    "\u2062",  # INVISIBLE TIMES
    "\u2063",  # INVISIBLE SEPARATOR
    "\u2064",  # INVISIBLE PLUS
    "\ufeff",  # ZERO WIDTH NO-BREAK SPACE (BOM)
    "\u00ad",  # SOFT HYPHEN
    "\u180e",  # MONGOLIAN VOWEL SEPARATOR
}

def find_zero_width(text: str) -> list[tuple[int, str, str]]:
    '''Return list of (index, char, name) for all zero-width characters found.'''
    import unicodedata
    results = []
    for i, c in enumerate(text):
        if c in ZERO_WIDTH_CHARS:
            name = unicodedata.name(c, f"U+{ord(c):04X}")
            results.append((i, c, name))
    return results

suspicious = "Hello\u200bWorld"
hits = find_zero_width(suspicious)
for idx, char, name in hits:
    print(f"  Position {idx}: {name} (U+{ord(char):04X})")
# Position 5: ZERO WIDTH SPACE (U+200B)

Using Unicode Category

The unicodedata category Cf ("Format character") covers most zero-width and invisible characters. Filtering on this category catches a broad class:

import unicodedata

def strip_format_chars(text: str) -> str:
    '''Remove all Unicode format characters (category Cf).'''
    return "".join(c for c in text if unicodedata.category(c) != "Cf")

def has_format_chars(text: str) -> bool:
    return any(unicodedata.category(c) == "Cf" for c in text)

print(has_format_chars("Hello\u200bWorld"))  # True
print(strip_format_chars("Hello\u200bWorld"))  # HelloWorld

Note: This will also strip ZWJ from emoji sequences, which may break complex emoji display. Apply carefully based on context.

JavaScript Detection

// Regex matching common zero-width characters
const ZERO_WIDTH_RE = /[\u200b-\u200f\u2060-\u2064\ufeff\u00ad\u180e]/g;

function hasZeroWidth(str) {
    return ZERO_WIDTH_RE.test(str);
}

function stripZeroWidth(str) {
    return str.replace(ZERO_WIDTH_RE, "");
}

// Detecting ZWJ in emoji sequences
function countGraphemeClusters(str) {
    // Intl.Segmenter uses proper grapheme cluster boundaries (including ZWJ sequences)
    const segmenter = new Intl.Segmenter();
    return [...segmenter.segment(str)].length;
}

const family = "\u{1F468}\u200d\u{1F469}\u200d\u{1F467}\u200d\u{1F466}";
console.log(family.length);                    // 7 (code units, counts surrogates + ZWJ)
console.log([...family].length);               // 7 (code points)
console.log(countGraphemeClusters(family));    // 1 (one rendered glyph)

Sanitization Strategy by Context

Context	Strategy
Usernames / identifiers	Strip all format characters (category Cf)
Passwords	Strip all format characters (or reject non-ASCII)
Plain text content	Strip ZWSP, WJ, LRM, RLM; preserve ZWJ (emoji), ZWNJ (CJK/Persian)
Source code identifiers	Reject any non-ASCII
HTML content (user generated)	Strip all zero-width except ZWJ in emoji context
Document watermarking output	Preserve intentional ZWSP encoding

Key Takeaways

Zero-width characters are Unicode code points that produce no visible glyph but are present in the string and affect behavior.
Legitimate uses: ZWSP for line-break hints (Thai/CJK), ZWJ for emoji sequences (👨‍👩‍👧‍👦), ZWNJ for cursive script control (Arabic/Persian), SHY for hyphenation hints.
Security risks: text fingerprinting/watermarking, password confusion, identifier spoofing, and parsing exploits.
Detection in Python: check unicodedata.category(c) == "Cf" for format characters, or maintain an explicit set of known zero-width code points.
Detection in JavaScript: use the regex /[\u200b-\u200f\u2060-\u2064\ufeff]/g and Intl.Segmenter for correct grapheme cluster counting.
Sanitization: for identifiers and passwords, strip all format characters. For rich text, preserve ZWJ (needed for complex emoji) and ZWNJ (needed for some scripts) but strip the rest.
The length property in both Python and JavaScript counts code points, not grapheme clusters — a string containing only ZWJ characters has a non-zero .length despite appearing empty.