Unicode Whitespace Characters Guide
Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used in typography. This guide catalogs all Unicode whitespace characters, explains their purposes, and shows how to handle them safely in code.
When most developers think of whitespace they think of three characters: the space bar, the tab, and the newline. In reality, Unicode defines over 25 distinct whitespace and space-like characters, each with different widths, line-breaking behaviors, and semantic purposes. Using the wrong whitespace character can break parsers, confuse search engines, create invisible security vulnerabilities, and cause layouts to collapse. This guide catalogs every Unicode whitespace character, explains when each one is appropriate, and shows how to detect and normalize them in code.
The Complete Unicode Whitespace Table
The Unicode property White_Space=Yes identifies characters that function as whitespace
in the Unicode standard. Here is every character with that property, plus several
space-like characters that behave as visual spaces but are not classified as
White_Space:
Characters with White_Space=Yes
| Char | Code Point | Name | Width | Breaks Line? |
|---|---|---|---|---|
| U+0009 | CHARACTER TABULATION (Tab) | Variable | No | |
| U+000A | LINE FEED (LF) | 0 | Yes | |
| U+000B | LINE TABULATION (VT) | 0 | Yes | |
| U+000C | FORM FEED (FF) | 0 | Yes | |
| U+000D | CARRIAGE RETURN (CR) | 0 | Yes | |
| U+0020 | SPACE | Normal | No | |
| U+0085 | NEXT LINE (NEL) | 0 | Yes | |
| \u00A0 | U+00A0 | NO-BREAK SPACE (NBSP) | Normal | No (non-breaking) |
| \u1680 | U+1680 | OGHAM SPACE MARK | Normal | No |
| \u2000 | U+2000 | EN QUAD | En-width | No |
| \u2001 | U+2001 | EM QUAD | Em-width | No |
| \u2002 | U+2002 | EN SPACE | En-width | No |
| \u2003 | U+2003 | EM SPACE | Em-width | No |
| \u2004 | U+2004 | THREE-PER-EM SPACE | 1/3 em | No |
| \u2005 | U+2005 | FOUR-PER-EM SPACE | 1/4 em | No |
| \u2006 | U+2006 | SIX-PER-EM SPACE | 1/6 em | No |
| \u2007 | U+2007 | FIGURE SPACE | Digit-width | No (non-breaking) |
| \u2008 | U+2008 | PUNCTUATION SPACE | Narrow | No |
| \u2009 | U+2009 | THIN SPACE | 1/5–1/6 em | No |
| \u200A | U+200A | HAIR SPACE | Thinnest | No |
| \u2028 | U+2028 | LINE SEPARATOR | 0 | Yes |
| \u2029 | U+2029 | PARAGRAPH SEPARATOR | 0 | Yes |
| \u202F | U+202F | NARROW NO-BREAK SPACE | Narrow | No (non-breaking) |
| \u205F | U+205F | MEDIUM MATHEMATICAL SPACE | 4/18 em | No |
| \u3000 | U+3000 | IDEOGRAPHIC SPACE | Full-width | No |
Space-Like Characters (White_Space=No)
These characters produce visual space or are zero-width, but Unicode does not
classify them as White_Space:
| Char | Code Point | Name | Width | Notes |
|---|---|---|---|---|
| \u200B | U+200B | ZERO WIDTH SPACE (ZWSP) | 0 | Line break opportunity |
| \u200C | U+200C | ZERO WIDTH NON-JOINER | 0 | Prevents ligature |
| \u200D | U+200D | ZERO WIDTH JOINER | 0 | Forces ligature |
| \uFEFF | U+FEFF | ZERO WIDTH NO-BREAK SPACE (BOM) | 0 | Byte order mark |
| \u2060 | U+2060 | WORD JOINER | 0 | Non-breaking, replaced BOM role |
| \u180E | U+180E | MONGOLIAN VOWEL SEPARATOR | 0 | Removed from Zs in Unicode 6.3 |
The Spaces You Use Most
Regular Space — U+0020
The standard ASCII space. Width is determined by the font. Line-break algorithms treat it as a valid break opportunity — text can wrap to the next line at a regular space. This is the space you get from your spacebar and the space that virtually all software expects.
No-Break Space (NBSP) — U+00A0
Identical in width to U+0020 but tells renderers not to break the line here. Use it to keep two words together on the same line — for example, between a number and its unit ("100\u00A0km") or between a title and a name ("Dr.\u00A0Smith").
In HTML, the entity produces this character. It is also the character generated
by Option+Space on macOS.
Common pitfall: NBSP looks identical to a regular space but fails string comparison.
If a user pastes text containing NBSP, your if text == "hello world" check will fail
because "hello\u00A0world" is not equal to "hello world".
Em Space — U+2003
A space whose width equals the current font size (1 em). In 16px body text, an em space
is 16px wide. Typographers use it for deep indentation and to create fixed-width
gutters. In HTML, you can use   to insert one.
En Space — U+2002
A space whose width is half an em (0.5 em). In 16px text, an en space is 8px wide. It
is the traditional typographic space used between numbers in tabular data. HTML entity:
 .
Thin Space — U+2009
A narrow space, typically 1/5 to 1/6 of an em. Used in French typography before
semicolons, question marks, and exclamation marks. Also used as a thousands separator
in numbers following SI conventions: "1\u2009000\u2009000" instead of "1,000,000".
HTML entity:  .
Hair Space — U+200A
The thinnest visible space in Unicode, roughly half the width of a thin space. Used for fine-grained typographic adjustments — for instance, adding a sliver of space around an em dash or between nested quotation marks: "She said, 'He whispered,\u200A"Help."\u200A'"
Figure Space — U+2007
A non-breaking space whose width matches the width of a digit (0–9) in the current font. Use it to align columns of numbers without using a monospace font:
Total: $1,234.56
Tax: $ 123.46
^^ figure spaces keep digits aligned
HTML does not have a named entity for it; use   or the CSS text-align and
font-variant-numeric: tabular-nums properties for proper numeric alignment.
Ideographic Space — U+3000
A full-width space used in CJK (Chinese, Japanese, Korean) typography. Its width matches a single CJK character, which is one em. In Japanese text, paragraph indentation uses U+3000 rather than multiple ASCII spaces. If your application handles CJK input, be aware that users may enter ideographic spaces that look like double-width regular spaces.
Zero-Width Spaces
These characters occupy no visible width but carry semantic meaning:
Zero Width Space (ZWSP) — U+200B
Provides a line-break opportunity without visible space. Useful in languages like Thai and Khmer that do not use spaces between words — inserting ZWSP between words allows the text to wrap correctly at word boundaries without adding visible gaps.
Also used in long URLs and technical strings to allow wrapping:
<span>https://example.com/very/long/path<wbr>/that/needs/wrapping</span>
<!-- The <wbr> element is equivalent to inserting U+200B -->
Word Joiner — U+2060
The opposite of ZWSP — it is a zero-width character that prevents a line break. Use it wherever you need two tokens to stay on the same line but don't want a visible space between them. It replaced the byte-order-mark character (U+FEFF) in this role as of Unicode 3.2.
For more on zero-width characters, see the Zero-Width Characters guide.
Detecting and Normalizing Whitespace
Python
Python's str.isspace() method returns True for characters with Unicode property
White_Space=Yes:
# Check if a character is Unicode whitespace
print("\u0020".isspace()) # True (regular space)
print("\u00A0".isspace()) # True (no-break space)
print("\u2003".isspace()) # True (em space)
print("\u200B".isspace()) # False (ZWSP — not White_Space)
print("\u3000".isspace()) # True (ideographic space)
To normalize all Unicode whitespace to regular ASCII spaces:
import re
def normalize_whitespace(text: str) -> str:
"""Replace all Unicode whitespace with regular spaces, collapse runs."""
# \\s matches all White_Space characters in Python regex
return re.sub(r"\s+", " ", text).strip()
messy = "Hello\u00A0\u2003world\u2009!\u3000End"
clean = normalize_whitespace(messy)
print(clean) # "Hello world ! End"
Warning: Python's \\s in regex matches White_Space=Yes characters but does not
match ZWSP (U+200B) or other zero-width characters. To strip those too:
import re
INVISIBLE_SPACES = re.compile(
"[\u200B\u200C\u200D\u2060\uFEFF]"
)
def strip_invisible(text: str) -> str:
"""Remove zero-width space-like characters."""
return INVISIBLE_SPACES.sub("", text)
def full_normalize(text: str) -> str:
"""Normalize all whitespace and strip invisible characters."""
text = strip_invisible(text)
return re.sub(r"\s+", " ", text).strip()
JavaScript
JavaScript's \\s in regex matches a subset of Unicode whitespace. For complete
coverage, use explicit character classes:
function normalizeWhitespace(text) {
// Match all Unicode whitespace characters
return text.replace(/[\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028\u2029\u202F\u205F\u3000]+/g, " ").trim();
}
function stripInvisible(text) {
return text.replace(/[\u200B\u200C\u200D\u2060\uFEFF]/g, "");
}
HTML
In HTML, consecutive whitespace characters are collapsed into a single space by default
(in normal flow). However, NBSP (\u00A0) is not collapsed — it always renders as a
space. This is why is used to create multiple visible spaces in HTML.
The CSS property white-space: pre preserves all whitespace; white-space: pre-wrap
preserves it but allows line wrapping.
Security Implications
Exotic whitespace characters are a vector for several security attacks:
Homograph-Style Attacks
An attacker registers "example.com" but uses IDEOGRAPHIC SPACE (U+3000) or other invisible characters in display names, URLs, or form fields to create strings that look identical to legitimate ones but differ at the byte level. Validation code that only trims ASCII spaces will miss these.
Code Injection
In programming languages and configuration files, unusual whitespace characters can bypass input validation. For example, U+00A0 inside a username might pass a "no spaces allowed" regex that only checks for U+0020:
# Vulnerable
username = "admin\u00A0"
if " " not in username:
print("No spaces found!") # Passes — but NBSP is there
Bidi + Whitespace
Combining right-to-left override characters (U+202E) with unusual spaces can create strings that display differently than their logical order, potentially hiding malicious content in file names, URLs, or source code.
Defense: Always normalize whitespace on input. Strip zero-width characters unless you have a specific reason to preserve them. Use Unicode-aware validation rather than ASCII-only checks.
Whitespace in Typography
Choosing the right space character is a typographic decision:
| Context | Recommended Space | Why |
|---|---|---|
| Number + unit (100 km) | U+00A0 (NBSP) | Prevent line break between value and unit |
| Thousands separator (1 000 000) | U+2009 (Thin Space) | SI convention, visually lighter than full space |
| French punctuation (Bonjour !) | U+202F (Narrow NBSP) | French typography requires thin non-breaking space before ;?!: |
| CJK paragraph indent | U+3000 (Ideographic) | Matches character width in CJK grid |
| Numeric alignment ($1,234) | U+2007 (Figure Space) | Keeps digits aligned in proportional fonts |
| Around em dash (word — word) | U+200A (Hair Space) | Adds breathing room without full space |
| Math formulas (a + b) | U+205F (Medium Math) | Standard math typesetting width |
| Prevent line break | U+2060 (Word Joiner) | Zero width, prevents break without adding space |
| Allow line break | U+200B (ZWSP) | Zero width, permits wrapping in long strings |
Testing for Whitespace Bugs
If your application accepts user input, test with these strings:
test_cases = [
"normal spaces",
"no-break\u00A0space",
"em\u2003space",
"thin\u2009space",
"ideographic\u3000space",
"zero-width\u200Bspace",
"mixed\u00A0\u2003\u200Ball spaces",
"\u00A0leading NBSP",
"trailing NBSP\u00A0",
"double\u00A0\u00A0NBSP",
"tab\u0009separated",
"crlf\u000D\u000Aline",
]
for case in test_cases:
# Does your search index match this against "normal spaces"?
# Does your trim function handle leading/trailing NBSP?
# Does your CSV parser split on all whitespace types?
process(case)
Summary
Unicode's rich whitespace inventory exists because different writing systems and typographic traditions need different kinds of space. The regular ASCII space (U+0020) is sufficient for most English text, but multilingual applications, typographic software, and security-conscious systems must account for the full range. No-break spaces prevent unwanted line breaks. Thin spaces and hair spaces provide fine-grained typographic control. Zero-width spaces enable wrapping in languages without word-separating spaces. And the full-width ideographic space matches CJK character grids. For robust text handling, normalize whitespace on input using Unicode-aware functions, strip zero-width characters unless intentionally preserved, and test with exotic whitespace in your validation and search code.
เพิ่มเติมใน Unicode Fundamentals
Unicode is the universal character encoding standard that assigns a unique number …
UTF-8 is the dominant character encoding on the web, capable of representing …
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …
A Unicode code point is the unique number assigned to each character …
Unicode is divided into 17 planes, each containing up to 65,536 code …
The Byte Order Mark (BOM) is a special Unicode character used at …
Surrogate pairs are a mechanism in UTF-16 that allows code points outside …
ASCII defined 128 characters for the English alphabet and was the foundation …
The same visible character can be represented by multiple different byte sequences …
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …
Every Unicode character belongs to a general category such as Letter, Number, …
Unicode blocks are contiguous ranges of code points grouped by script or …
Unicode assigns every character to a script property that identifies the writing …
Combining characters are Unicode code points that attach to a preceding base …
A single visible character on screen — called a grapheme cluster — …
Unicode confusables are characters that look identical or nearly identical to others, …
Zero-width characters are invisible Unicode code points that affect text layout, joining, …
Unicode began in 1987 as a collaboration between engineers at Apple and …
Unicode has released major versions regularly since 1.0 in 1991, with each …