空白字符
表示水平或垂直空间但无可见字形的字符,Unicode定义了17种以上具有不同宽度和换行行为的空白字符。
What is Whitespace?
Whitespace refers to any character that represents horizontal or vertical blank space without producing a visible mark. The term covers a wide family of characters — from the ordinary space (U+0020) you type with the spacebar, to tab, newline, carriage return, and a rich set of Unicode-specific space characters with precise typographic widths.
In text processing, whitespace characters are fundamental delimiters. In typography, different whitespace characters carry specific semantic meanings about the size and nature of the blank space they represent.
The Unicode Whitespace Family
Unicode defines approximately 25 characters with whitespace properties. The most important are:
| Character | Unicode | Name | Width |
|---|---|---|---|
|
U+0020 | Space | Standard word space |
| U+00A0 | No-Break Space | Standard (no line break) | |
|
U+2002 | En Space | 1 en (½ em) |
|
U+2003 | Em Space | 1 em |
|
U+2004 | Three-Per-Em Space | ⅓ em |
|
U+2005 | Four-Per-Em Space | ¼ em |
|
U+2006 | Six-Per-Em Space | ⅙ em |
|
U+2007 | Figure Space | Width of a digit |
|
U+2008 | Punctuation Space | Width of a period |
|
U+2009 | Thin Space | ⅕ em |
|
U+200A | Hair Space | 1/24 em (approx) |
\t |
U+0009 | Character Tabulation (Tab) | Variable (tab stop) |
\n |
U+000A | Line Feed | Vertical |
\r |
U+000D | Carriage Return | Vertical |
|
U+3000 | Ideographic Space | 1 full-width em (CJK) |
Control Characters vs. Space Characters
Whitespace splits into two categories: - Space characters: primarily horizontal, create horizontal gaps - Control characters: include line terminators (LF, CR, CRLF, LS, PS) that break lines and create vertical space
Line separator (U+2028) and paragraph separator (U+2029) are Unicode-specific line terminators that some parsers recognize. JavaScript template literals, for example, treat both as newlines.
Whitespace in Programming
Whitespace handling varies significantly across languages and contexts:
# Python: indentation is syntax
if True:
print("This indent matters")
# Python string whitespace
import re
text = "hello\u2003world" # em space
re.split(r'\s+', text) # ['hello', 'world'] — \s matches em space
text.split(' ') # ['hello\u2003world'] — only splits on U+0020
// JavaScript: \s in regex matches all Unicode whitespace
/\s/.test('\u2003') // true — em space
/\s/.test('\u3000') // true — ideographic space
/* CSS white-space property controls whitespace rendering */
p { white-space: normal; } /* collapse and wrap (default) */
p { white-space: pre; } /* preserve all whitespace */
p { white-space: nowrap; } /* no line wrapping */
p { white-space: pre-wrap; } /* preserve, but allow wrapping */
Ideographic Space in CJK Typography
The ideographic space (U+3000, ) is a full-width space used in Chinese, Japanese, and Korean text. Its width matches a full CJK character (one em in a monospace CJK context). It is visually distinct from a standard space and matters for alignment in vertically set or grid-aligned CJK typography.
Quick Facts
| Property | Value |
|---|---|
| Standard space | U+0020 |
| Non-breaking space | U+00A0 |
| Thin space | U+2009 (⅕ em) |
| Hair space | U+200A (1/24 em) |
| Em space | U+2003 |
| Ideographic space (CJK) | U+3000 |
| Unicode line separator | U+2028 |
| Unicode paragraph separator | U+2029 |
\s in most regex engines |
Matches all Unicode whitespace |
相关术语
排版印刷 中的更多内容
CSS @font-face descriptor specifying which Unicode code points a font should cover. …
The mechanism by which a rendering engine substitutes glyphs from a secondary …
Modern font format developed by Microsoft and Adobe supporting up to 65,535 …
字符从右向左流动的文本方向,用于阿拉伯语、希伯来语、塔阿纳等文字,正确显示需要双向算法。
Fonts downloaded by the browser to render text, declared via CSS @font-face. …
U+00A0,防止在该位置换行的空格。HTML中为 ,用于数字与单位之间(100 km)、专有名词(Mr. Smith)和缩写之后。
全角(Em):等于字号的宽度;半角(En):全角的一半,用于定义全角破折号宽度、全角空格、半角空格和CSS单位(1em、0.5em)。
附加在字母上以改变发音或意义的符号,可以是预组合形式(é U+00E9)或组合形式(e + ◌́ U+0065+U+0301),包括重音、变音符、软音符和波浪号等。
特定大小、字重和样式的字型实现,在数字排版中指包含字形定义和度量的字体文件(TTF、OTF、WOFF2)。
字体渲染的字符视觉表现形式。一个字符可有多个字形(连字、上下文形式),一个字形也可表示多个字符。