🔣 Symbol Reference

Whitespace and Invisible Characters Guide

Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width spaces, word joiners, soft hyphens, and various format control characters that affect text layout without appearing on screen. This guide catalogs all significant invisible Unicode characters, explains their legitimate uses, and shows how to detect and remove unwanted ones.

·

Unicode contains far more space characters than the single spacebar key suggests. Beyond the ordinary space (U+0020), the standard defines over 25 distinct whitespace and invisible characters — each with specific typographic width, line-breaking behavior, and use case. Knowing which space to use, and how to detect unwanted invisible characters, is essential for both typographers and developers.

Quick Copy-Paste Table

Symbol Name Code Point Width Breaks?
Space U+0020 Variable (word space) Yes
No-Break Space (NBSP) U+00A0 Variable (word space) No
En Space U+2002 0.5 em Yes
Em Space U+2003 1 em Yes
Three-Per-Em Space U+2004 1/3 em Yes
Four-Per-Em Space U+2005 1/4 em Yes
Six-Per-Em Space U+2006 1/6 em Yes
Figure Space U+2007 Same as digit No
Punctuation Space U+2008 Same as period Yes
Thin Space U+2009 1/5 em (approx) Yes
Hair Space U+200A Thinner than thin Yes
Zero Width Space U+200B Zero Yes (hint)
Zero Width Joiner U+200D Zero No
Zero Width Non-Joiner U+200C Zero No
Word Joiner U+2060 Zero No
No-Break Narrow Space (NNBSP) U+202F Narrow No
Left-to-Right Mark U+200E Zero No
Right-to-Left Mark U+200F Zero No
Ideographic Space U+3000 Full em (CJK) Yes
Horizontal Tab U+0009 Variable Yes

Ordinary Space: U+0020

The standard space (U+0020) is the ASCII space character — the one produced by the spacebar. Its width is defined by the font's word-space metric and expands slightly in justified text. It is a legal line-break opportunity: renderers may break a line at any standard space.

Almost all whitespace normalization in programming treats U+0020 as the canonical space. When text is stripped or normalized, this is what you're collapsing to.


Non-Breaking Space: U+00A0

The non-breaking space (NBSP, U+00A0) has the same visual width as an ordinary space but prevents a line break at that position. It is the most common "special" space in everyday writing.

When to use NBSP

  • Titles and abbreviations: Mr. + NBSP + Smith — prevents Mr. being stranded at line end
  • Numbers with units: 100 + NBSP + km — keeps the number with its unit
  • Dates: January + NBSP + 14 — prevents the day being separated from the month
  • Currency: $ + NBSP + 4.99 — keeps the symbol with the number
  • French typography: Required before :, ;, !, ?, and » / after «
<!-- HTML -->
Mr.&nbsp;Smith
100&nbsp;km
# Python
NBSP = "\u00A0"
title = f"Mr.{NBSP}Smith"

Fixed-Width Spaces

The following spaces have widths defined relative to the font's em square, making them useful for precise typographic alignment that ordinary word spaces (which expand in justification) cannot provide.

Em Space: U+2003 (1 em wide)

One em equals the point size of the current font — in 12pt text, an em space is 12pt wide. It is used for paragraph indentation in some typographic traditions (East Asian typography uses the ideographic space U+3000 instead).

En Space: U+2002 (0.5 em wide)

Exactly half an em. Traditionally the width of the digit "n." Can be used to indent the second line of a list item or to separate elements when an em space is too wide.

Three-Per-Em Space: U+2004 (⅓ em)

Divides the em into three equal parts. Used in mathematical typesetting when thin but non-trivial separation is needed between elements.

Four-Per-Em Space: U+2005 (¼ em)

Also called mid space. Quarter-em width.

Six-Per-Em Space: U+2006 (⅙ em)

Also called small space or sixth-em space. Often used between a number and a following percent sign in some European typographic traditions.

Figure Space: U+2007 (digit width, non-breaking)

The figure space has the same width as a digit (0–9) in the current font — useful because digits are typically monospaced (tabular) in well-designed fonts. Figure spaces allow numbers to align in tables and lists when they have varying digit counts:

 1 apple
12 oranges
 3 bananas

Crucially, U+2007 is non-breaking — it will not cause a line break, which is correct behavior when it replaces a missing digit in an aligned column.

Punctuation Space: U+2008 (period width)

The width of a period or comma. Used in typesetting to replace punctuation for alignment purposes.


Thin and Hair Spaces

Thin Space: U+2009 (~1/5 em)

The thin space is the most commonly needed fixed-width space in typographic work. It is used in:

  • Thousands separators in French and SI notation: 1 234 567
  • Between a number and its unit per SI rules: 37 °C, 100 km
  • Around ellipses in some styles: word … word
  • Between quotation marks and content in French typography: « content »
  • In mathematical typesetting around operators and punctuation
1&thinsp;234&thinsp;567  <!-- French thousands separator -->
37&thinsp;°C             <!-- SI unit spacing -->

Hair Space: U+200A (very thin)

The hair space is thinner than the thin space — the thinnest space defined in Unicode. It is used:

  • Between a number and a following percent sign: 42 %
  • In some mathematical spacing conventions
  • In fine typography to add just a hint of air between adjacent characters

Both thin space and hair space permit line breaks, which is occasionally unwanted. Combine with a word joiner (U+2060) or use U+202F (narrow no-break space) instead if a break would be harmful.


Zero-Width Characters

These characters take up no horizontal space but carry typographic or text-processing semantics.

Zero Width Space: U+200B

The zero-width space (ZWSP) is invisible with no width, but it marks a legal line-break opportunity. It is invaluable in:

  • Languages without whitespace word boundaries (Thai, Tibetan, Khmer, CJK) — the renderer can break a long sequence at ZWSP positions
  • URLs displayed in prose — insert ZWSP before slashes to enable wrapping without adding visible characters
  • Long strings in narrow containers
<!-- Allow wrapping in a long URL without hyphenating -->
<a href="...">https://​example​.com​/very​/long​/path</a>

Zero Width Non-Joiner: U+200C

The ZWNJ separates characters that would otherwise ligate or form a cursive connection. In Arabic, Persian, and Devanagari scripts, some adjacent characters join by default. ZWNJ prevents that joining:

  • Persian: Breaking a cursive connection within a word without inserting visible space
  • Devanagari: Preventing a conjunct consonant from forming

Zero Width Joiner: U+200D

The ZWJ causes characters that could join (but would not by default) to join. Its most visible use is in emoji sequences:

  • 👨‍💻 = Man Technologist = U+1F468 + ZWJ + U+1F4BB
  • 👨‍👩‍👧 = Family = U+1F468 + ZWJ + U+1F469 + ZWJ + U+1F467
  • 🏳️‍🌈 = Rainbow Flag = U+1F3F3 + ZWJ + U+1F308

ZWJ sequences are how most complex emoji are built from simpler base characters.

Word Joiner: U+2060

The word joiner is a zero-width, non-breaking, non-printing character that prevents a line break at its position — similar to NBSP but without adding any visible space. It is preferred over the legacy U+FEFF (BOM) for this purpose in modern Unicode text.


Narrow No-Break Space: U+202F

The narrow no-break space (NNBSP) combines the properties of the thin space (narrow width) with the no-break property of NBSP. It is the ideal character for:

  • Thousands separators where a break would be confusing: 1 234 567
  • SI unit spacing in contexts where wrapping must be prevented: 37 °C
  • French typography before :, ;, !, ? — the Imprimerie nationale (French national printing office) style

Bidirectional Marks

These zero-width characters affect text direction in bidirectional text (mixing left-to-right and right-to-left scripts).

Left-to-Right Mark: U+200E

Forces the surrounding text direction algorithm into LTR mode at that point. Invisible, zero-width.

Right-to-Left Mark: U+200F

Forces RTL mode. Used in Arabic, Hebrew, and other RTL scripts to correctly order punctuation or numbers within RTL text.


Ideographic Space: U+3000

The ideographic space is the full-width space used in CJK (Chinese, Japanese, Korean) typography. It is exactly one em wide and is the standard word-separator in CJK text. In Japanese, paragraph indentation is typically one ideographic space.


Developer Reference

Detecting and Stripping Invisible Characters

import unicodedata

# Unicode categories for whitespace
# Zs = Space separator, Cc = Control, Cf = Format

ZERO_WIDTH_CHARS = {
    "\u200B",  # Zero Width Space
    "\u200C",  # Zero Width Non-Joiner
    "\u200D",  # Zero Width Joiner
    "\u200E",  # Left-to-Right Mark
    "\u200F",  # Right-to-Left Mark
    "\u2060",  # Word Joiner
    "\uFEFF",  # BOM / Zero Width No-Break Space (legacy)
}

def has_invisible_chars(text: str) -> bool:
    """Return True if text contains zero-width or invisible Unicode characters."""
    return any(char in ZERO_WIDTH_CHARS for char in text)

def strip_invisible_chars(text: str) -> str:
    """Remove all zero-width and invisible Unicode formatting characters."""
    return "".join(ch for ch in text if ch not in ZERO_WIDTH_CHARS)

def classify_spaces(text: str) -> list[tuple[str, str, str]]:
    """Return list of (char, code_point, name) for all space-like characters."""
    results = []
    for char in text:
        cat = unicodedata.category(char)
        if cat in ("Zs", "Cc", "Cf") or char in ZERO_WIDTH_CHARS:
            name = unicodedata.name(char, "UNKNOWN")
            cp   = f"U+{ord(char):04X}"
            results.append((repr(char), cp, name))
    return results

Normalizing Whitespace

import re
import unicodedata

# Collapse any run of Unicode whitespace to a single ASCII space
def normalize_whitespace(text: str) -> str:
    # \s in Python re does not match all Unicode whitespace by default
    # Use unicodedata to identify Zs category characters
    result = []
    i = 0
    while i < len(text):
        char = text[i]
        if unicodedata.category(char) == "Zs" or char in ("\t", "\n", "\r"):
            result.append(" ")
            # Skip consecutive whitespace
            while i + 1 < len(text) and (
                unicodedata.category(text[i + 1]) == "Zs"
                or text[i + 1] in ("\t", "\n", "\r")
            ):
                i += 1
        else:
            result.append(char)
        i += 1
    return "".join(result).strip()

HTML Reference

<!-- Common named entities -->
&nbsp;     <!-- U+00A0 Non-breaking space -->
&ensp;     <!-- U+2002 En space -->
&emsp;     <!-- U+2003 Em space -->
&thinsp;   <!-- U+2009 Thin space -->
&hairsp;   <!-- U+200A Hair space (HTML5 only) -->
&zwsp;     <!-- U+200B Zero-width space (HTML5) -->
&zwnj;     <!-- U+200C Zero-width non-joiner -->
&zwj;      <!-- U+200D Zero-width joiner -->
&lrm;      <!-- U+200E Left-to-right mark -->
&rlm;      <!-- U+200F Right-to-left mark -->
&#8239;    <!-- U+202F Narrow no-break space (no named entity) -->

Security: Invisible Character Attacks

Invisible and zero-width characters have been exploited in various attacks:

  • Trojan Source (2021): ZWJ and ZWNJ characters injected into source code strings can create bidirectional text attacks that make code look different to humans than to compilers.
  • Homoglyph attacks: Invisible characters can be inserted into identifiers to create visually identical but technically different strings (admin vs admin​ with a ZWSP).
  • Text fingerprinting / steganography: Services have embedded unique combinations of zero-width characters in exported text to watermark and identify leaks.
import unicodedata

def is_suspicious_text(text: str) -> bool:
    """Flag text containing unusual invisible Unicode characters."""
    suspicious = {
        "\u200B",  # ZWSP
        "\u200C",  # ZWNJ
        "\u200D",  # ZWJ
        "\u200E",  # LRM
        "\u200F",  # RLM
        "\u202A",  # Left-to-Right Embedding
        "\u202B",  # Right-to-Left Embedding
        "\u202C",  # Pop Directional Formatting
        "\u202D",  # Left-to-Right Override
        "\u202E",  # Right-to-Left Override  <- frequently used in attacks
        "\u2066",  # Left-to-Right Isolate
        "\u2067",  # Right-to-Left Isolate
        "\u2069",  # Pop Directional Isolate
    }
    return bool(set(text) & suspicious)

Space Width Summary Chart

From widest to narrowest:

Space Width Code Point
Ideographic space 1 em (full-width) U+3000
Em space 1 em U+2003
En space ½ em U+2002
Three-per-em space ⅓ em U+2004
Four-per-em (mid) space ¼ em U+2005
Six-per-em space ⅙ em U+2006
Thin space ~⅕ em U+2009
Narrow no-break space ~⅕ em (no-break) U+202F
Hair space < thin U+200A
Zero-width space 0 U+200B
Zero-width joiner 0 U+200D
Zero-width non-joiner 0 U+200C
Word joiner 0 U+2060

Thêm trong Symbol Reference

Complete Arrow Symbols List

Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, …

All Check Mark and Tick Symbols

Unicode provides multiple check mark and tick symbols ranging from the classic …

Star and Asterisk Symbols

Unicode includes a rich collection of star shapes — from the simple …

Heart Symbols Complete Guide

Unicode contains dozens of heart symbols including the classic ♥, black and …

Currency Symbols Around the World

Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over …

Mathematical Symbols and Operators

Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric …

Bracket and Parenthesis Symbols

Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly …

Bullet Point Symbols

Unicode offers a wide variety of bullet point characters beyond the standard …

Line and Box Drawing Characters

Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, …

Musical Note Symbols

Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols …

Fraction Symbols Guide

Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ …

Superscript and Subscript Characters

Unicode provides precomposed superscript and subscript digits and letters — such as …

Circle Symbols

Unicode contains dozens of circle symbols including filled circles, outlined circles, circles …

Square and Rectangle Symbols

Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, …

Triangle Symbols

Unicode provides a comprehensive set of triangle symbols in all orientations — …

Diamond Symbols

Unicode includes filled and outline diamond shapes, lozenge characters, and playing card …

Cross and X Mark Symbols

Unicode provides various cross and X mark characters including the heavy ballot …

Dash and Hyphen Symbols Guide

The hyphen-minus on your keyboard is just one of Unicode's many dash …

Quotation Mark Symbols Complete Guide

Unicode defines typographic quotation marks — curly quotes — for dozens of …

Copyright, Trademark & Legal Symbols

Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, …

Degree and Temperature Symbols

The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ …

Circled and Enclosed Number Symbols

Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and …

Roman Numeral Symbols

Unicode includes a Number Forms block with precomposed Roman numeral characters such …

Greek Alphabet Symbols for Math and Science

Greek letters like α β γ δ π Σ Ω are widely …

Decorative Dingbats

The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the …

Playing Card Symbols

Unicode includes a Playing Cards block with characters for all 52 standard …

Chess Piece Symbols

Unicode provides characters for all six chess piece types in both white …

Zodiac and Astrological Symbols

Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, …

Braille Pattern Characters

Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the …

Geometric Shapes Complete Guide

Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, …

Letterlike Symbols

The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from …

Technical Symbols Guide

Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including …

Combining Characters and Diacritics Guide

Diacritics are accent marks and other marks that attach to letters to …

Warning and Hazard Signs

Unicode includes warning and hazard symbols such as the universal caution ⚠ …

Weather Symbols Guide

Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow …

Religious Symbols in Unicode

Unicode includes symbols for many of the world's major religions including the …

Gender and Identity Symbols

Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, …

Keyboard Shortcut Symbols Guide

Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ …

Symbols for Social Media Bios

Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds …