📚 Unicode Fundamentals

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used in typography. This guide catalogs all Unicode whitespace characters, explains their purposes, and shows how to handle them safely in code.

Published 2021-12-06 · Updated 2024-06-11

When most developers think of whitespace they think of three characters: the space bar, the tab, and the newline. In reality, Unicode defines over 25 distinct whitespace and space-like characters, each with different widths, line-breaking behaviors, and semantic purposes. Using the wrong whitespace character can break parsers, confuse search engines, create invisible security vulnerabilities, and cause layouts to collapse. This guide catalogs every Unicode whitespace character, explains when each one is appropriate, and shows how to detect and normalize them in code.

The Complete Unicode Whitespace Table

The Unicode property White_Space=Yes identifies characters that function as whitespace in the Unicode standard. Here is every character with that property, plus several space-like characters that behave as visual spaces but are not classified as White_Space:

Characters with White_Space=Yes

Char	Code Point	Name	Width	Breaks Line?
	U+0009	CHARACTER TABULATION (Tab)	Variable	No
	U+000A	LINE FEED (LF)	0	Yes
	U+000B	LINE TABULATION (VT)	0	Yes
	U+000C	FORM FEED (FF)	0	Yes
	U+000D	CARRIAGE RETURN (CR)	0	Yes
	U+0020	SPACE	Normal	No
	U+0085	NEXT LINE (NEL)	0	Yes
\u00A0	U+00A0	NO-BREAK SPACE (NBSP)	Normal	No (non-breaking)
\u1680	U+1680	OGHAM SPACE MARK	Normal	No
\u2000	U+2000	EN QUAD	En-width	No
\u2001	U+2001	EM QUAD	Em-width	No
\u2002	U+2002	EN SPACE	En-width	No
\u2003	U+2003	EM SPACE	Em-width	No
\u2004	U+2004	THREE-PER-EM SPACE	1/3 em	No
\u2005	U+2005	FOUR-PER-EM SPACE	1/4 em	No
\u2006	U+2006	SIX-PER-EM SPACE	1/6 em	No
\u2007	U+2007	FIGURE SPACE	Digit-width	No (non-breaking)
\u2008	U+2008	PUNCTUATION SPACE	Narrow	No
\u2009	U+2009	THIN SPACE	1/5–1/6 em	No
\u200A	U+200A	HAIR SPACE	Thinnest	No
\u2028	U+2028	LINE SEPARATOR	0	Yes
\u2029	U+2029	PARAGRAPH SEPARATOR	0	Yes
\u202F	U+202F	NARROW NO-BREAK SPACE	Narrow	No (non-breaking)
\u205F	U+205F	MEDIUM MATHEMATICAL SPACE	4/18 em	No
\u3000	U+3000	IDEOGRAPHIC SPACE	Full-width	No

Space-Like Characters (White_Space=No)

These characters produce visual space or are zero-width, but Unicode does not classify them as White_Space:

Char	Code Point	Name	Notes
\u200B	U+200B	ZERO WIDTH SPACE (ZWSP)	Line break opportunity
\u200C	U+200C	ZERO WIDTH NON-JOINER	Prevents ligature
\u200D	U+200D	ZERO WIDTH JOINER	Forces ligature
\uFEFF	U+FEFF	ZERO WIDTH NO-BREAK SPACE (BOM)	Byte order mark
\u2060	U+2060	WORD JOINER	Non-breaking, replaced BOM role
\u180E	U+180E	MONGOLIAN VOWEL SEPARATOR	Removed from Zs in Unicode 6.3

The Spaces You Use Most

Regular Space — U+0020

The standard ASCII space. Width is determined by the font. Line-break algorithms treat it as a valid break opportunity — text can wrap to the next line at a regular space. This is the space you get from your spacebar and the space that virtually all software expects.

No-Break Space (NBSP) — U+00A0

Identical in width to U+0020 but tells renderers not to break the line here. Use it to keep two words together on the same line — for example, between a number and its unit ("100\u00A0km") or between a title and a name ("Dr.\u00A0Smith").

In HTML, the entity   produces this character. It is also the character generated by Option+Space on macOS.

Common pitfall: NBSP looks identical to a regular space but fails string comparison. If a user pastes text containing NBSP, your if text == "hello world" check will fail because "hello\u00A0world" is not equal to "hello world".

Em Space — U+2003

A space whose width equals the current font size (1 em). In 16px body text, an em space is 16px wide. Typographers use it for deep indentation and to create fixed-width gutters. In HTML, you can use &emsp; to insert one.

En Space — U+2002

A space whose width is half an em (0.5 em). In 16px text, an en space is 8px wide. It is the traditional typographic space used between numbers in tabular data. HTML entity: &ensp;.

Thin Space — U+2009

A narrow space, typically 1/5 to 1/6 of an em. Used in French typography before semicolons, question marks, and exclamation marks. Also used as a thousands separator in numbers following SI conventions: "1\u2009000\u2009000" instead of "1,000,000". HTML entity:  .

Hair Space — U+200A

The thinnest visible space in Unicode, roughly half the width of a thin space. Used for fine-grained typographic adjustments — for instance, adding a sliver of space around an em dash or between nested quotation marks: "She said, 'He whispered,\u200A"Help."\u200A'"

Figure Space — U+2007

A non-breaking space whose width matches the width of a digit (0–9) in the current font. Use it to align columns of numbers without using a monospace font:

Total: $1,234.56
Tax:   $  123.46
               ^^ figure spaces keep digits aligned

HTML does not have a named entity for it; use   or the CSS text-align and font-variant-numeric: tabular-nums properties for proper numeric alignment.

Ideographic Space — U+3000

A full-width space used in CJK (Chinese, Japanese, Korean) typography. Its width matches a single CJK character, which is one em. In Japanese text, paragraph indentation uses U+3000 rather than multiple ASCII spaces. If your application handles CJK input, be aware that users may enter ideographic spaces that look like double-width regular spaces.

Zero-Width Spaces

These characters occupy no visible width but carry semantic meaning:

Zero Width Space (ZWSP) — U+200B

Provides a line-break opportunity without visible space. Useful in languages like Thai and Khmer that do not use spaces between words — inserting ZWSP between words allows the text to wrap correctly at word boundaries without adding visible gaps.

Also used in long URLs and technical strings to allow wrapping:

<span>https://example.com/very/long/path<wbr>/that/needs/wrapping</span>
<!-- The <wbr> element is equivalent to inserting U+200B -->

Word Joiner — U+2060

The opposite of ZWSP — it is a zero-width character that prevents a line break. Use it wherever you need two tokens to stay on the same line but don't want a visible space between them. It replaced the byte-order-mark character (U+FEFF) in this role as of Unicode 3.2.

For more on zero-width characters, see the Zero-Width Characters guide.

Detecting and Normalizing Whitespace

Python

Python's str.isspace() method returns True for characters with Unicode property White_Space=Yes:

# Check if a character is Unicode whitespace
print("\u0020".isspace())   # True  (regular space)
print("\u00A0".isspace())   # True  (no-break space)
print("\u2003".isspace())   # True  (em space)
print("\u200B".isspace())   # False (ZWSP — not White_Space)
print("\u3000".isspace())   # True  (ideographic space)

To normalize all Unicode whitespace to regular ASCII spaces:

import re

def normalize_whitespace(text: str) -> str:
    """Replace all Unicode whitespace with regular spaces, collapse runs."""
    # \\s matches all White_Space characters in Python regex
    return re.sub(r"\s+", " ", text).strip()

messy = "Hello\u00A0\u2003world\u2009!\u3000End"
clean = normalize_whitespace(messy)
print(clean)  # "Hello world ! End"

Warning: Python's \\s in regex matches White_Space=Yes characters but does not match ZWSP (U+200B) or other zero-width characters. To strip those too:

import re

INVISIBLE_SPACES = re.compile(
    "[\u200B\u200C\u200D\u2060\uFEFF]"
)

def strip_invisible(text: str) -> str:
    """Remove zero-width space-like characters."""
    return INVISIBLE_SPACES.sub("", text)

def full_normalize(text: str) -> str:
    """Normalize all whitespace and strip invisible characters."""
    text = strip_invisible(text)
    return re.sub(r"\s+", " ", text).strip()

JavaScript

JavaScript's \\s in regex matches a subset of Unicode whitespace. For complete coverage, use explicit character classes:

function normalizeWhitespace(text) {
  // Match all Unicode whitespace characters
  return text.replace(/[\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]+/g, " ").trim();
}

function stripInvisible(text) {
  return text.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
}

HTML

In HTML, consecutive whitespace characters are collapsed into a single space by default (in normal flow). However, NBSP (\u00A0) is not collapsed — it always renders as a space. This is why   is used to create multiple visible spaces in HTML.

The CSS property white-space: pre preserves all whitespace; white-space: pre-wrap preserves it but allows line wrapping.

Security Implications

Exotic whitespace characters are a vector for several security attacks:

Homograph-Style Attacks

An attacker registers "example.com" but uses IDEOGRAPHIC SPACE (U+3000) or other invisible characters in display names, URLs, or form fields to create strings that look identical to legitimate ones but differ at the byte level. Validation code that only trims ASCII spaces will miss these.

Code Injection

In programming languages and configuration files, unusual whitespace characters can bypass input validation. For example, U+00A0 inside a username might pass a "no spaces allowed" regex that only checks for U+0020:

# Vulnerable
username = "admin\u00A0"
if " " not in username:
    print("No spaces found!")  # Passes — but NBSP is there

Bidi + Whitespace

Combining right-to-left override characters (U+202E) with unusual spaces can create strings that display differently than their logical order, potentially hiding malicious content in file names, URLs, or source code.

Defense: Always normalize whitespace on input. Strip zero-width characters unless you have a specific reason to preserve them. Use Unicode-aware validation rather than ASCII-only checks.

Whitespace in Typography

Choosing the right space character is a typographic decision:

Context	Recommended Space	Why
Number + unit (100 km)	U+00A0 (NBSP)	Prevent line break between value and unit
Thousands separator (1 000 000)	U+2009 (Thin Space)	SI convention, visually lighter than full space
French punctuation (Bonjour !)	U+202F (Narrow NBSP)	French typography requires thin non-breaking space before ;?!:
CJK paragraph indent	U+3000 (Ideographic)	Matches character width in CJK grid
Numeric alignment ($1,234)	U+2007 (Figure Space)	Keeps digits aligned in proportional fonts
Around em dash (word — word)	U+200A (Hair Space)	Adds breathing room without full space
Math formulas (a + b)	U+205F (Medium Math)	Standard math typesetting width
Prevent line break	U+2060 (Word Joiner)	Zero width, prevents break without adding space
Allow line break	U+200B (ZWSP)	Zero width, permits wrapping in long strings

Testing for Whitespace Bugs

If your application accepts user input, test with these strings:

test_cases = [
    "normal spaces",
    "no-break\u00A0space",
    "em\u2003space",
    "thin\u2009space",
    "ideographic\u3000space",
    "zero-width\u200Bspace",
    "mixed\u00A0\u2003\u200Ball spaces",
    "\u00A0leading NBSP",
    "trailing NBSP\u00A0",
    "double\u00A0\u00A0NBSP",
    "tab\u0009separated",
    "crlf\u000D\u000Aline",
]

for case in test_cases:
    # Does your search index match this against "normal spaces"?
    # Does your trim function handle leading/trailing NBSP?
    # Does your CSV parser split on all whitespace types?
    process(case)

Summary

Unicode's rich whitespace inventory exists because different writing systems and typographic traditions need different kinds of space. The regular ASCII space (U+0020) is sufficient for most English text, but multilingual applications, typographic software, and security-conscious systems must account for the full range. No-break spaces prevent unwanted line breaks. Thin spaces and hair spaces provide fine-grained typographic control. Zero-width spaces enable wrapping in languages without word-separating spaces. And the full-width ideographic space matches CJK character grids. For robust text handling, normalize whitespace on input using Unicode-aware functions, strip zero-width characters unless intentionally preserved, and test with exotic whitespace in your validation and search code.