📚 Unicode Fundamentals

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used in typography. This guide catalogs all Unicode whitespace characters, explains their purposes, and shows how to handle them safely in code.

·

When most developers think of whitespace they think of three characters: the space bar, the tab, and the newline. In reality, Unicode defines over 25 distinct whitespace and space-like characters, each with different widths, line-breaking behaviors, and semantic purposes. Using the wrong whitespace character can break parsers, confuse search engines, create invisible security vulnerabilities, and cause layouts to collapse. This guide catalogs every Unicode whitespace character, explains when each one is appropriate, and shows how to detect and normalize them in code.

The Complete Unicode Whitespace Table

The Unicode property White_Space=Yes identifies characters that function as whitespace in the Unicode standard. Here is every character with that property, plus several space-like characters that behave as visual spaces but are not classified as White_Space:

Characters with White_Space=Yes

Char Code Point Name Width Breaks Line?
U+0009 CHARACTER TABULATION (Tab) Variable No
U+000A LINE FEED (LF) 0 Yes
U+000B LINE TABULATION (VT) 0 Yes
U+000C FORM FEED (FF) 0 Yes
U+000D CARRIAGE RETURN (CR) 0 Yes
U+0020 SPACE Normal No
U+0085 NEXT LINE (NEL) 0 Yes
\u00A0 U+00A0 NO-BREAK SPACE (NBSP) Normal No (non-breaking)
\u1680 U+1680 OGHAM SPACE MARK Normal No
\u2000 U+2000 EN QUAD En-width No
\u2001 U+2001 EM QUAD Em-width No
\u2002 U+2002 EN SPACE En-width No
\u2003 U+2003 EM SPACE Em-width No
\u2004 U+2004 THREE-PER-EM SPACE 1/3 em No
\u2005 U+2005 FOUR-PER-EM SPACE 1/4 em No
\u2006 U+2006 SIX-PER-EM SPACE 1/6 em No
\u2007 U+2007 FIGURE SPACE Digit-width No (non-breaking)
\u2008 U+2008 PUNCTUATION SPACE Narrow No
\u2009 U+2009 THIN SPACE 1/5–1/6 em No
\u200A U+200A HAIR SPACE Thinnest No
\u2028 U+2028 LINE SEPARATOR 0 Yes
\u2029 U+2029 PARAGRAPH SEPARATOR 0 Yes
\u202F U+202F NARROW NO-BREAK SPACE Narrow No (non-breaking)
\u205F U+205F MEDIUM MATHEMATICAL SPACE 4/18 em No
\u3000 U+3000 IDEOGRAPHIC SPACE Full-width No

Space-Like Characters (White_Space=No)

These characters produce visual space or are zero-width, but Unicode does not classify them as White_Space:

Char Code Point Name Width Notes
\u200B U+200B ZERO WIDTH SPACE (ZWSP) 0 Line break opportunity
\u200C U+200C ZERO WIDTH NON-JOINER 0 Prevents ligature
\u200D U+200D ZERO WIDTH JOINER 0 Forces ligature
\uFEFF U+FEFF ZERO WIDTH NO-BREAK SPACE (BOM) 0 Byte order mark
\u2060 U+2060 WORD JOINER 0 Non-breaking, replaced BOM role
\u180E U+180E MONGOLIAN VOWEL SEPARATOR 0 Removed from Zs in Unicode 6.3

The Spaces You Use Most

Regular Space — U+0020

The standard ASCII space. Width is determined by the font. Line-break algorithms treat it as a valid break opportunity — text can wrap to the next line at a regular space. This is the space you get from your spacebar and the space that virtually all software expects.

No-Break Space (NBSP) — U+00A0

Identical in width to U+0020 but tells renderers not to break the line here. Use it to keep two words together on the same line — for example, between a number and its unit ("100\u00A0km") or between a title and a name ("Dr.\u00A0Smith").

In HTML, the entity   produces this character. It is also the character generated by Option+Space on macOS.

Common pitfall: NBSP looks identical to a regular space but fails string comparison. If a user pastes text containing NBSP, your if text == "hello world" check will fail because "hello\u00A0world" is not equal to "hello world".

Em Space — U+2003

A space whose width equals the current font size (1 em). In 16px body text, an em space is 16px wide. Typographers use it for deep indentation and to create fixed-width gutters. In HTML, you can use   to insert one.

En Space — U+2002

A space whose width is half an em (0.5 em). In 16px text, an en space is 8px wide. It is the traditional typographic space used between numbers in tabular data. HTML entity:  .

Thin Space — U+2009

A narrow space, typically 1/5 to 1/6 of an em. Used in French typography before semicolons, question marks, and exclamation marks. Also used as a thousands separator in numbers following SI conventions: "1\u2009000\u2009000" instead of "1,000,000". HTML entity:  .

Hair Space — U+200A

The thinnest visible space in Unicode, roughly half the width of a thin space. Used for fine-grained typographic adjustments — for instance, adding a sliver of space around an em dash or between nested quotation marks: "She said, 'He whispered,\u200A"Help."\u200A'"

Figure Space — U+2007

A non-breaking space whose width matches the width of a digit (0–9) in the current font. Use it to align columns of numbers without using a monospace font:

Total: $1,234.56
Tax:   $  123.46
               ^^ figure spaces keep digits aligned

HTML does not have a named entity for it; use   or the CSS text-align and font-variant-numeric: tabular-nums properties for proper numeric alignment.

Ideographic Space — U+3000

A full-width space used in CJK (Chinese, Japanese, Korean) typography. Its width matches a single CJK character, which is one em. In Japanese text, paragraph indentation uses U+3000 rather than multiple ASCII spaces. If your application handles CJK input, be aware that users may enter ideographic spaces that look like double-width regular spaces.

Zero-Width Spaces

These characters occupy no visible width but carry semantic meaning:

Zero Width Space (ZWSP) — U+200B

Provides a line-break opportunity without visible space. Useful in languages like Thai and Khmer that do not use spaces between words — inserting ZWSP between words allows the text to wrap correctly at word boundaries without adding visible gaps.

Also used in long URLs and technical strings to allow wrapping:

<span>https://example.com/very/long/path<wbr>/that/needs/wrapping</span>
<!-- The <wbr> element is equivalent to inserting U+200B -->

Word Joiner — U+2060

The opposite of ZWSP — it is a zero-width character that prevents a line break. Use it wherever you need two tokens to stay on the same line but don't want a visible space between them. It replaced the byte-order-mark character (U+FEFF) in this role as of Unicode 3.2.

For more on zero-width characters, see the Zero-Width Characters guide.

Detecting and Normalizing Whitespace

Python

Python's str.isspace() method returns True for characters with Unicode property White_Space=Yes:

# Check if a character is Unicode whitespace
print("\u0020".isspace())   # True  (regular space)
print("\u00A0".isspace())   # True  (no-break space)
print("\u2003".isspace())   # True  (em space)
print("\u200B".isspace())   # False (ZWSP — not White_Space)
print("\u3000".isspace())   # True  (ideographic space)

To normalize all Unicode whitespace to regular ASCII spaces:

import re

def normalize_whitespace(text: str) -> str:
    """Replace all Unicode whitespace with regular spaces, collapse runs."""
    # \\s matches all White_Space characters in Python regex
    return re.sub(r"\s+", " ", text).strip()

messy = "Hello\u00A0\u2003world\u2009!\u3000End"
clean = normalize_whitespace(messy)
print(clean)  # "Hello world ! End"

Warning: Python's \\s in regex matches White_Space=Yes characters but does not match ZWSP (U+200B) or other zero-width characters. To strip those too:

import re

INVISIBLE_SPACES = re.compile(
    "[\u200B\u200C\u200D\u2060\uFEFF]"
)

def strip_invisible(text: str) -> str:
    """Remove zero-width space-like characters."""
    return INVISIBLE_SPACES.sub("", text)

def full_normalize(text: str) -> str:
    """Normalize all whitespace and strip invisible characters."""
    text = strip_invisible(text)
    return re.sub(r"\s+", " ", text).strip()

JavaScript

JavaScript's \\s in regex matches a subset of Unicode whitespace. For complete coverage, use explicit character classes:

function normalizeWhitespace(text) {
  // Match all Unicode whitespace characters
  return text.replace(/[\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]+/g, " ").trim();
}

function stripInvisible(text) {
  return text.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
}

HTML

In HTML, consecutive whitespace characters are collapsed into a single space by default (in normal flow). However, NBSP (\u00A0) is not collapsed — it always renders as a space. This is why &nbsp; is used to create multiple visible spaces in HTML.

The CSS property white-space: pre preserves all whitespace; white-space: pre-wrap preserves it but allows line wrapping.

Security Implications

Exotic whitespace characters are a vector for several security attacks:

Homograph-Style Attacks

An attacker registers "example.com" but uses IDEOGRAPHIC SPACE (U+3000) or other invisible characters in display names, URLs, or form fields to create strings that look identical to legitimate ones but differ at the byte level. Validation code that only trims ASCII spaces will miss these.

Code Injection

In programming languages and configuration files, unusual whitespace characters can bypass input validation. For example, U+00A0 inside a username might pass a "no spaces allowed" regex that only checks for U+0020:

# Vulnerable
username = "admin\u00A0"
if " " not in username:
    print("No spaces found!")  # Passes — but NBSP is there

Bidi + Whitespace

Combining right-to-left override characters (U+202E) with unusual spaces can create strings that display differently than their logical order, potentially hiding malicious content in file names, URLs, or source code.

Defense: Always normalize whitespace on input. Strip zero-width characters unless you have a specific reason to preserve them. Use Unicode-aware validation rather than ASCII-only checks.

Whitespace in Typography

Choosing the right space character is a typographic decision:

Context Recommended Space Why
Number + unit (100 km) U+00A0 (NBSP) Prevent line break between value and unit
Thousands separator (1 000 000) U+2009 (Thin Space) SI convention, visually lighter than full space
French punctuation (Bonjour !) U+202F (Narrow NBSP) French typography requires thin non-breaking space before ;?!:
CJK paragraph indent U+3000 (Ideographic) Matches character width in CJK grid
Numeric alignment ($1,234) U+2007 (Figure Space) Keeps digits aligned in proportional fonts
Around em dash (word — word) U+200A (Hair Space) Adds breathing room without full space
Math formulas (a + b) U+205F (Medium Math) Standard math typesetting width
Prevent line break U+2060 (Word Joiner) Zero width, prevents break without adding space
Allow line break U+200B (ZWSP) Zero width, permits wrapping in long strings

Testing for Whitespace Bugs

If your application accepts user input, test with these strings:

test_cases = [
    "normal spaces",
    "no-break\u00A0space",
    "em\u2003space",
    "thin\u2009space",
    "ideographic\u3000space",
    "zero-width\u200Bspace",
    "mixed\u00A0\u2003\u200Ball spaces",
    "\u00A0leading NBSP",
    "trailing NBSP\u00A0",
    "double\u00A0\u00A0NBSP",
    "tab\u0009separated",
    "crlf\u000D\u000Aline",
]

for case in test_cases:
    # Does your search index match this against "normal spaces"?
    # Does your trim function handle leading/trailing NBSP?
    # Does your CSV parser split on all whitespace types?
    process(case)

Summary

Unicode's rich whitespace inventory exists because different writing systems and typographic traditions need different kinds of space. The regular ASCII space (U+0020) is sufficient for most English text, but multilingual applications, typographic software, and security-conscious systems must account for the full range. No-break spaces prevent unwanted line breaks. Thin spaces and hair spaces provide fine-grained typographic control. Zero-width spaces enable wrapping in languages without word-separating spaces. And the full-width ideographic space matches CJK character grids. For robust text handling, normalize whitespace on input using Unicode-aware functions, strip zero-width characters unless intentionally preserved, and test with exotic whitespace in your validation and search code.

เพิ่มเติมใน Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …