Whitespace and Invisible Characters Guide
Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width spaces, word joiners, soft hyphens, and various format control characters that affect text layout without appearing on screen. This guide catalogs all significant invisible Unicode characters, explains their legitimate uses, and shows how to detect and remove unwanted ones.
Unicode contains far more space characters than the single spacebar key suggests. Beyond the ordinary space (U+0020), the standard defines over 25 distinct whitespace and invisible characters — each with specific typographic width, line-breaking behavior, and use case. Knowing which space to use, and how to detect unwanted invisible characters, is essential for both typographers and developers.
Quick Copy-Paste Table
| Symbol | Name | Code Point | Width | Breaks? |
|---|---|---|---|---|
| Space | U+0020 | Variable (word space) | Yes | |
| No-Break Space (NBSP) | U+00A0 | Variable (word space) | No | |
| En Space | U+2002 | 0.5 em | Yes | |
| Em Space | U+2003 | 1 em | Yes | |
| Three-Per-Em Space | U+2004 | 1/3 em | Yes | |
| Four-Per-Em Space | U+2005 | 1/4 em | Yes | |
| Six-Per-Em Space | U+2006 | 1/6 em | Yes | |
| Figure Space | U+2007 | Same as digit | No | |
| Punctuation Space | U+2008 | Same as period | Yes | |
| Thin Space | U+2009 | 1/5 em (approx) | Yes | |
| Hair Space | U+200A | Thinner than thin | Yes | |
| | Zero Width Space | U+200B | Zero | Yes (hint) |
| | Zero Width Joiner | U+200D | Zero | No |
| | Zero Width Non-Joiner | U+200C | Zero | No |
| Word Joiner | U+2060 | Zero | No | |
| | No-Break Narrow Space (NNBSP) | U+202F | Narrow | No |
| | Left-to-Right Mark | U+200E | Zero | No |
| | Right-to-Left Mark | U+200F | Zero | No |
| Ideographic Space | U+3000 | Full em (CJK) | Yes | |
| Horizontal Tab | U+0009 | Variable | Yes |
Ordinary Space: U+0020
The standard space (U+0020) is the ASCII space character — the one produced by the spacebar. Its width is defined by the font's word-space metric and expands slightly in justified text. It is a legal line-break opportunity: renderers may break a line at any standard space.
Almost all whitespace normalization in programming treats U+0020 as the canonical space. When text is stripped or normalized, this is what you're collapsing to.
Non-Breaking Space: U+00A0
The non-breaking space (NBSP, U+00A0) has the same visual width as an ordinary space but prevents a line break at that position. It is the most common "special" space in everyday writing.
When to use NBSP
- Titles and abbreviations:
Mr.+ NBSP +Smith— preventsMr.being stranded at line end - Numbers with units:
100+ NBSP +km— keeps the number with its unit - Dates:
January+ NBSP +14— prevents the day being separated from the month - Currency:
$+ NBSP +4.99— keeps the symbol with the number - French typography: Required before
:,;,!,?, and»/ after«
<!-- HTML -->
Mr. Smith
100 km
# Python
NBSP = "\u00A0"
title = f"Mr.{NBSP}Smith"
Fixed-Width Spaces
The following spaces have widths defined relative to the font's em square, making them useful for precise typographic alignment that ordinary word spaces (which expand in justification) cannot provide.
Em Space: U+2003 (1 em wide)
One em equals the point size of the current font — in 12pt text, an em space is 12pt wide. It is used for paragraph indentation in some typographic traditions (East Asian typography uses the ideographic space U+3000 instead).
En Space: U+2002 (0.5 em wide)
Exactly half an em. Traditionally the width of the digit "n." Can be used to indent the second line of a list item or to separate elements when an em space is too wide.
Three-Per-Em Space: U+2004 (⅓ em)
Divides the em into three equal parts. Used in mathematical typesetting when thin but non-trivial separation is needed between elements.
Four-Per-Em Space: U+2005 (¼ em)
Also called mid space. Quarter-em width.
Six-Per-Em Space: U+2006 (⅙ em)
Also called small space or sixth-em space. Often used between a number and a following percent sign in some European typographic traditions.
Figure Space: U+2007 (digit width, non-breaking)
The figure space has the same width as a digit (0–9) in the current font — useful because digits are typically monospaced (tabular) in well-designed fonts. Figure spaces allow numbers to align in tables and lists when they have varying digit counts:
1 apple
12 oranges
3 bananas
Crucially, U+2007 is non-breaking — it will not cause a line break, which is correct behavior when it replaces a missing digit in an aligned column.
Punctuation Space: U+2008 (period width)
The width of a period or comma. Used in typesetting to replace punctuation for alignment purposes.
Thin and Hair Spaces
Thin Space: U+2009 (~1/5 em)
The thin space is the most commonly needed fixed-width space in typographic work. It is used in:
- Thousands separators in French and SI notation:
1 234 567 - Between a number and its unit per SI rules:
37 °C,100 km - Around ellipses in some styles:
word … word - Between quotation marks and content in French typography:
« content » - In mathematical typesetting around operators and punctuation
1 234 567 <!-- French thousands separator -->
37 °C <!-- SI unit spacing -->
Hair Space: U+200A (very thin)
The hair space is thinner than the thin space — the thinnest space defined in Unicode. It is used:
- Between a number and a following percent sign:
42 % - In some mathematical spacing conventions
- In fine typography to add just a hint of air between adjacent characters
Both thin space and hair space permit line breaks, which is occasionally unwanted. Combine with a word joiner (U+2060) or use U+202F (narrow no-break space) instead if a break would be harmful.
Zero-Width Characters
These characters take up no horizontal space but carry typographic or text-processing semantics.
Zero Width Space: U+200B
The zero-width space (ZWSP) is invisible with no width, but it marks a legal line-break opportunity. It is invaluable in:
- Languages without whitespace word boundaries (Thai, Tibetan, Khmer, CJK) — the renderer can break a long sequence at ZWSP positions
- URLs displayed in prose — insert ZWSP before slashes to enable wrapping without adding visible characters
- Long strings in narrow containers
<!-- Allow wrapping in a long URL without hyphenating -->
<a href="...">https://example.com/very/long/path</a>
Zero Width Non-Joiner: U+200C
The ZWNJ separates characters that would otherwise ligate or form a cursive connection. In Arabic, Persian, and Devanagari scripts, some adjacent characters join by default. ZWNJ prevents that joining:
- Persian: Breaking a cursive connection within a word without inserting visible space
- Devanagari: Preventing a conjunct consonant from forming
Zero Width Joiner: U+200D
The ZWJ causes characters that could join (but would not by default) to join. Its most visible use is in emoji sequences:
- 👨💻 = Man Technologist = U+1F468 + ZWJ + U+1F4BB
- 👨👩👧 = Family = U+1F468 + ZWJ + U+1F469 + ZWJ + U+1F467
- 🏳️🌈 = Rainbow Flag = U+1F3F3 + ZWJ + U+1F308
ZWJ sequences are how most complex emoji are built from simpler base characters.
Word Joiner: U+2060
The word joiner is a zero-width, non-breaking, non-printing character that prevents a line break at its position — similar to NBSP but without adding any visible space. It is preferred over the legacy U+FEFF (BOM) for this purpose in modern Unicode text.
Narrow No-Break Space: U+202F
The narrow no-break space (NNBSP) combines the properties of the thin space (narrow width) with the no-break property of NBSP. It is the ideal character for:
- Thousands separators where a break would be confusing:
1 234 567 - SI unit spacing in contexts where wrapping must be prevented:
37 °C - French typography before
:,;,!,?— the Imprimerie nationale (French national printing office) style
Bidirectional Marks
These zero-width characters affect text direction in bidirectional text (mixing left-to-right and right-to-left scripts).
Left-to-Right Mark: U+200E
Forces the surrounding text direction algorithm into LTR mode at that point. Invisible, zero-width.
Right-to-Left Mark: U+200F
Forces RTL mode. Used in Arabic, Hebrew, and other RTL scripts to correctly order punctuation or numbers within RTL text.
Ideographic Space: U+3000
The ideographic space is the full-width space used in CJK (Chinese, Japanese, Korean) typography. It is exactly one em wide and is the standard word-separator in CJK text. In Japanese, paragraph indentation is typically one ideographic space.
Developer Reference
Detecting and Stripping Invisible Characters
import unicodedata
# Unicode categories for whitespace
# Zs = Space separator, Cc = Control, Cf = Format
ZERO_WIDTH_CHARS = {
"\u200B", # Zero Width Space
"\u200C", # Zero Width Non-Joiner
"\u200D", # Zero Width Joiner
"\u200E", # Left-to-Right Mark
"\u200F", # Right-to-Left Mark
"\u2060", # Word Joiner
"\uFEFF", # BOM / Zero Width No-Break Space (legacy)
}
def has_invisible_chars(text: str) -> bool:
"""Return True if text contains zero-width or invisible Unicode characters."""
return any(char in ZERO_WIDTH_CHARS for char in text)
def strip_invisible_chars(text: str) -> str:
"""Remove all zero-width and invisible Unicode formatting characters."""
return "".join(ch for ch in text if ch not in ZERO_WIDTH_CHARS)
def classify_spaces(text: str) -> list[tuple[str, str, str]]:
"""Return list of (char, code_point, name) for all space-like characters."""
results = []
for char in text:
cat = unicodedata.category(char)
if cat in ("Zs", "Cc", "Cf") or char in ZERO_WIDTH_CHARS:
name = unicodedata.name(char, "UNKNOWN")
cp = f"U+{ord(char):04X}"
results.append((repr(char), cp, name))
return results
Normalizing Whitespace
import re
import unicodedata
# Collapse any run of Unicode whitespace to a single ASCII space
def normalize_whitespace(text: str) -> str:
# \s in Python re does not match all Unicode whitespace by default
# Use unicodedata to identify Zs category characters
result = []
i = 0
while i < len(text):
char = text[i]
if unicodedata.category(char) == "Zs" or char in ("\t", "\n", "\r"):
result.append(" ")
# Skip consecutive whitespace
while i + 1 < len(text) and (
unicodedata.category(text[i + 1]) == "Zs"
or text[i + 1] in ("\t", "\n", "\r")
):
i += 1
else:
result.append(char)
i += 1
return "".join(result).strip()
HTML Reference
<!-- Common named entities -->
<!-- U+00A0 Non-breaking space -->
  <!-- U+2002 En space -->
  <!-- U+2003 Em space -->
  <!-- U+2009 Thin space -->
  <!-- U+200A Hair space (HTML5 only) -->
&zwsp; <!-- U+200B Zero-width space (HTML5) -->
‌ <!-- U+200C Zero-width non-joiner -->
‍ <!-- U+200D Zero-width joiner -->
‎ <!-- U+200E Left-to-right mark -->
‏ <!-- U+200F Right-to-left mark -->
  <!-- U+202F Narrow no-break space (no named entity) -->
Security: Invisible Character Attacks
Invisible and zero-width characters have been exploited in various attacks:
- Trojan Source (2021): ZWJ and ZWNJ characters injected into source code strings can create bidirectional text attacks that make code look different to humans than to compilers.
- Homoglyph attacks: Invisible characters can be inserted into identifiers to create visually identical but technically different strings (
adminvsadminwith a ZWSP). - Text fingerprinting / steganography: Services have embedded unique combinations of zero-width characters in exported text to watermark and identify leaks.
import unicodedata
def is_suspicious_text(text: str) -> bool:
"""Flag text containing unusual invisible Unicode characters."""
suspicious = {
"\u200B", # ZWSP
"\u200C", # ZWNJ
"\u200D", # ZWJ
"\u200E", # LRM
"\u200F", # RLM
"\u202A", # Left-to-Right Embedding
"\u202B", # Right-to-Left Embedding
"\u202C", # Pop Directional Formatting
"\u202D", # Left-to-Right Override
"\u202E", # Right-to-Left Override <- frequently used in attacks
"\u2066", # Left-to-Right Isolate
"\u2067", # Right-to-Left Isolate
"\u2069", # Pop Directional Isolate
}
return bool(set(text) & suspicious)
Space Width Summary Chart
From widest to narrowest:
| Space | Width | Code Point |
|---|---|---|
| Ideographic space | 1 em (full-width) | U+3000 |
| Em space | 1 em | U+2003 |
| En space | ½ em | U+2002 |
| Three-per-em space | ⅓ em | U+2004 |
| Four-per-em (mid) space | ¼ em | U+2005 |
| Six-per-em space | ⅙ em | U+2006 |
| Thin space | ~⅕ em | U+2009 |
| Narrow no-break space | ~⅕ em (no-break) | U+202F |
| Hair space | < thin | U+200A |
| Zero-width space | 0 | U+200B |
| Zero-width joiner | 0 | U+200D |
| Zero-width non-joiner | 0 | U+200C |
| Word joiner | 0 | U+2060 |
Ещё в Symbol Reference
Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, …
Unicode provides multiple check mark and tick symbols ranging from the classic …
Unicode includes a rich collection of star shapes — from the simple …
Unicode contains dozens of heart symbols including the classic ♥, black and …
Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over …
Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric …
Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly …
Unicode offers a wide variety of bullet point characters beyond the standard …
Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, …
Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols …
Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ …
Unicode provides precomposed superscript and subscript digits and letters — such as …
Unicode contains dozens of circle symbols including filled circles, outlined circles, circles …
Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, …
Unicode provides a comprehensive set of triangle symbols in all orientations — …
Unicode includes filled and outline diamond shapes, lozenge characters, and playing card …
Unicode provides various cross and X mark characters including the heavy ballot …
The hyphen-minus on your keyboard is just one of Unicode's many dash …
Unicode defines typographic quotation marks — curly quotes — for dozens of …
Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, …
The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ …
Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and …
Unicode includes a Number Forms block with precomposed Roman numeral characters such …
Greek letters like α β γ δ π Σ Ω are widely …
The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the …
Unicode includes a Playing Cards block with characters for all 52 standard …
Unicode provides characters for all six chess piece types in both white …
Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, …
Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the …
Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, …
The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from …
Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including …
Diacritics are accent marks and other marks that attach to letters to …
Unicode includes warning and hazard symbols such as the universal caution ⚠ …
Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow …
Unicode includes symbols for many of the world's major religions including the …
Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, …
Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ …
Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds …