Zero Width Characters: What They Are and Why They Matter
Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This guide explains the most important zero-width characters, their legitimate uses, and how they are abused for data exfiltration and plagiarism detection.
Some Unicode characters occupy no visible space. They render as nothing. You cannot see them,
cannot select them by clicking around them in most text editors, and cannot detect them with
a casual glance at a string's .length property. Yet they are present — silently influencing
text rendering, word breaking, ligature formation, and increasingly, user tracking and security
exploits. These are zero-width characters, and every developer working with user-supplied
text needs to understand them.
The Zero-Width Characters
Unicode defines a family of characters that produce no visible glyph. Here are the most important ones:
| Character | Code Point | Name | Primary Purpose |
|---|---|---|---|
| ZWSP | U+200B | ZERO WIDTH SPACE | Suggest line-break opportunity without visible space |
| ZWJ | U+200D | ZERO WIDTH JOINER | Join adjacent characters (e.g., emoji sequences) |
| ZWNJ | U+200C | ZERO WIDTH NON-JOINER | Prevent joining (e.g., in Arabic or Persian) |
| WJ | U+2060 | WORD JOINER | Prevent line break (ZWSP's opposite) |
| ZWNBSP | U+FEFF | ZERO WIDTH NO-BREAK SPACE | Originally BOM; also prevents break |
| SHY | U+00AD | SOFT HYPHEN | Suggests hyphenation point; invisible if no break |
| LRMI / RLMI | U+200E / U+200F | LEFT-TO-RIGHT MARK / RIGHT-TO-LEFT MARK | Force text direction |
In Python:
zwsp = "\u200b" # ZERO WIDTH SPACE
zwj = "\u200d" # ZERO WIDTH JOINER
zwnj = "\u200c" # ZERO WIDTH NON-JOINER
wj = "\u2060" # WORD JOINER
feff = "\ufeff" # ZERO WIDTH NO-BREAK SPACE (BOM)
# All have length 1 but render as nothing
for c in [zwsp, zwj, zwnj, wj, feff]:
print(f"U+{ord(c):04X}: len={len(c)}, repr={repr(c)}")
Legitimate Uses
Zero-width characters have genuine, important uses in text typography and script rendering. Understanding the legitimate uses prevents over-aggressive filtering.
ZERO WIDTH SPACE (U+200B) — Line Break Hints
Some languages write words without spaces (Thai, Japanese, Chinese). ZWSP can be inserted at valid word boundaries to hint to the browser's line-breaking algorithm where it may wrap:
<!-- Long Thai URL that can line-break at ZWSP positions -->
<p>เดินทางไป​กรุงเทพ​มหานคร</p>
This is semantically clean: ZWSP means "this is a valid place to break the line" without adding a visible space.
ZERO WIDTH JOINER (U+200D) — Emoji Sequences
ZWJ (U+200D) is the backbone of complex emoji. Modern emoji like family emoji, profession emoji, and skin-tone combinations are built from sequences of simpler emoji joined by ZWJ:
# Family: man + ZWJ + woman + ZWJ + girl + ZWJ + boy
family = "\U0001F468\u200d\U0001F469\u200d\U0001F467\u200d\U0001F466"
print(family) # 👨👩👧👦 (single rendered glyph on supporting platforms)
print(len(family)) # 7 code points (4 emoji + 3 ZWJ)
# Profession emoji: woman + ZWJ + laptop
developer = "\U0001F469\u200d\U0001F4BB"
print(developer) # 👩💻
Without ZWJ awareness, processing emoji sequences one code point at a time will split joined family emoji into their component parts — a common source of emoji-handling bugs.
ZERO WIDTH NON-JOINER (U+200C) — Script Cursive Breaking
Arabic and Persian are cursive scripts where adjacent letters naturally join. ZWNJ (U+200C) forces two adjacent characters to appear in their isolated forms, as if they were not adjacent.
In Persian, this is grammatically necessary: the suffix "ها" (plural marker) must not join with the preceding noun in certain contexts:
# Without ZWNJ: letters join cursively (may be grammatically wrong)
# With ZWNJ: letters appear isolated
persian_correct = "کتاب\u200cها" # کتاب + ZWNJ + ها = "books" (correct non-joining)
SOFT HYPHEN (U+00AD) — Hyphenation Hints
SHY (U+00AD) is invisible but tells the browser: "if you need to break this word here, insert a hyphen at this point." It's used for long technical terms in justified text:
<p>This is a very long word: anti­dis­estab­lish­ment</p>
Only the hyphen at the actual break point is rendered; all others remain invisible.
Zero-Width Characters as Security Threats
The same invisibility that makes zero-width characters useful for typography makes them dangerous when injected into identifiers, passwords, or tracking payloads.
Text Fingerprinting / Watermarking
Zero-width characters can be used to uniquely encode a bit pattern within text, invisibly tagging each copy distributed to a different recipient. By varying the presence or absence of ZWSP, ZWJ, and ZWNJ at specific positions, an attacker or leaker-tracker can embed a binary identifier:
import unicodedata
# Simple example: encode a 3-bit ID as zero-width characters
# 0 = no character, 1 = ZWSP
def encode_watermark(text: str, bit_id: int, num_bits: int = 3) -> str:
'''Embed a binary watermark into text using ZWSP insertions.'''
zwsp = "\u200b"
result = list(text)
positions = [i for i, c in enumerate(text) if c == " "][:num_bits]
for i, pos in enumerate(positions):
if (bit_id >> (num_bits - 1 - i)) & 1:
result.insert(pos, zwsp)
return "".join(result)
def decode_watermark(text: str, positions: list[int]) -> int:
'''Extract watermark bit pattern from text.'''
zwsp = "\u200b"
chars = list(text)
bits = 0
for i, pos in enumerate(positions):
if pos < len(chars) and chars[pos] == zwsp:
bits |= 1 << (len(positions) - 1 - i)
return bits
This technique has been used to identify which internal source leaked a confidential document. With 16 zero-width characters, you can encode 65,536 unique IDs — enough to tag every employee in a large organization.
Password Confusion
A user's password contains an invisible ZWSP. When they copy-paste it, the zero-width character comes along. The password appears correct on screen but fails authentication. This is a particularly insidious support burden.
# Password stored: "secret\u200bpass" (10 code points)
# User types: "secretpass" (9 code points)
# They look identical → support ticket
stored = "secret\u200bpass"
typed = "secretpass"
print(stored == typed) # False — because of ZWSP!
print(stored.replace("\u200b", "") == typed) # True
Identifier Spoofing
Zero-width characters can be inserted into usernames, variable names, or identifiers to create strings that display identically but compare as different:
real_admin = "admin"
fake_admin = "adm\u200bin" # admin + ZWSP in the middle
print(real_admin) # admin
print(fake_admin) # admin (visually identical)
print(real_admin == fake_admin) # False
This enables account spoofing attacks in applications that display usernames without rendering zero-width characters visibly.
Code Injection via Zero-Width Characters
In some template engines and markdown parsers, zero-width characters inside strings can break escaping logic or syntax parsing, since the parser doesn't expect invisible characters inside what appears to be a clean identifier.
Detection
Finding Zero-Width Characters
ZERO_WIDTH_CHARS = {
"\u200b", # ZERO WIDTH SPACE
"\u200c", # ZERO WIDTH NON-JOINER
"\u200d", # ZERO WIDTH JOINER
"\u200e", # LEFT-TO-RIGHT MARK
"\u200f", # RIGHT-TO-LEFT MARK
"\u2060", # WORD JOINER
"\u2061", # FUNCTION APPLICATION
"\u2062", # INVISIBLE TIMES
"\u2063", # INVISIBLE SEPARATOR
"\u2064", # INVISIBLE PLUS
"\ufeff", # ZERO WIDTH NO-BREAK SPACE (BOM)
"\u00ad", # SOFT HYPHEN
"\u180e", # MONGOLIAN VOWEL SEPARATOR
}
def find_zero_width(text: str) -> list[tuple[int, str, str]]:
'''Return list of (index, char, name) for all zero-width characters found.'''
import unicodedata
results = []
for i, c in enumerate(text):
if c in ZERO_WIDTH_CHARS:
name = unicodedata.name(c, f"U+{ord(c):04X}")
results.append((i, c, name))
return results
suspicious = "Hello\u200bWorld"
hits = find_zero_width(suspicious)
for idx, char, name in hits:
print(f" Position {idx}: {name} (U+{ord(char):04X})")
# Position 5: ZERO WIDTH SPACE (U+200B)
Using Unicode Category
The unicodedata category Cf ("Format character") covers most zero-width and invisible
characters. Filtering on this category catches a broad class:
import unicodedata
def strip_format_chars(text: str) -> str:
'''Remove all Unicode format characters (category Cf).'''
return "".join(c for c in text if unicodedata.category(c) != "Cf")
def has_format_chars(text: str) -> bool:
return any(unicodedata.category(c) == "Cf" for c in text)
print(has_format_chars("Hello\u200bWorld")) # True
print(strip_format_chars("Hello\u200bWorld")) # HelloWorld
Note: This will also strip ZWJ from emoji sequences, which may break complex emoji display. Apply carefully based on context.
JavaScript Detection
// Regex matching common zero-width characters
const ZERO_WIDTH_RE = /[\u200b-\u200f\u2060-\u2064\ufeff\u00ad\u180e]/g;
function hasZeroWidth(str) {
return ZERO_WIDTH_RE.test(str);
}
function stripZeroWidth(str) {
return str.replace(ZERO_WIDTH_RE, "");
}
// Detecting ZWJ in emoji sequences
function countGraphemeClusters(str) {
// Intl.Segmenter uses proper grapheme cluster boundaries (including ZWJ sequences)
const segmenter = new Intl.Segmenter();
return [...segmenter.segment(str)].length;
}
const family = "\u{1F468}\u200d\u{1F469}\u200d\u{1F467}\u200d\u{1F466}";
console.log(family.length); // 7 (code units, counts surrogates + ZWJ)
console.log([...family].length); // 7 (code points)
console.log(countGraphemeClusters(family)); // 1 (one rendered glyph)
Sanitization Strategy by Context
| Context | Strategy |
|---|---|
| Usernames / identifiers | Strip all format characters (category Cf) |
| Passwords | Strip all format characters (or reject non-ASCII) |
| Plain text content | Strip ZWSP, WJ, LRM, RLM; preserve ZWJ (emoji), ZWNJ (CJK/Persian) |
| Source code identifiers | Reject any non-ASCII |
| HTML content (user generated) | Strip all zero-width except ZWJ in emoji context |
| Document watermarking output | Preserve intentional ZWSP encoding |
Key Takeaways
- Zero-width characters are Unicode code points that produce no visible glyph but are present in the string and affect behavior.
- Legitimate uses: ZWSP for line-break hints (Thai/CJK), ZWJ for emoji sequences (👨👩👧👦), ZWNJ for cursive script control (Arabic/Persian), SHY for hyphenation hints.
- Security risks: text fingerprinting/watermarking, password confusion, identifier spoofing, and parsing exploits.
- Detection in Python: check
unicodedata.category(c) == "Cf"for format characters, or maintain an explicit set of known zero-width code points. - Detection in JavaScript: use the regex
/[\u200b-\u200f\u2060-\u2064\ufeff]/gandIntl.Segmenterfor correct grapheme cluster counting. - Sanitization: for identifiers and passwords, strip all format characters. For rich text, preserve ZWJ (needed for complex emoji) and ZWNJ (needed for some scripts) but strip the rest.
- The
lengthproperty in both Python and JavaScript counts code points, not grapheme clusters — a string containing only ZWJ characters has a non-zero.lengthdespite appearing empty.
المزيد في Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Unicode defines over two dozen whitespace characters beyond the ordinary space, including …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …