字符串长度歧义
Unicode字符串的“长度”取决于计量单位:码元(JavaScript .length)、码位(Python len())或字素簇。👨👩👧👦 = 7个码位,1个字素。
What Is String Length and Why Is It Complicated?
Asking "how long is this string?" seems simple but has multiple valid answers depending on what you mean by "length":
- Bytes: How many bytes does the encoded string occupy?
- Code units: How many code units in the string's internal encoding (UTF-16, UTF-8)?
- Code points: How many Unicode code points (scalar values) does the string contain?
- Grapheme clusters: How many user-perceived characters (glyphs the user would call "one letter") does the string contain?
These four counts can all be different for the same string.
A Concrete Example
Consider the string "café" followed by a family emoji "👨👩👧":
import unicodedata
s = "café"
# c-a-f-e+combining-accent OR c-a-f-é (precomposed)
# Decomposed NFC vs NFD matters here
nfc = unicodedata.normalize("NFC", "café") # 4 code points
nfd = unicodedata.normalize("NFD", "café") # 5 code points (e + combining acute)
len(nfc) # 4
len(nfd) # 5
# Family emoji: man + ZWJ + woman + ZWJ + girl = 8 code points, 1 grapheme
family = "👨👩👧"
len(family) # 8 (code points: 3 emoji + 2 ZWJ)
const family = "👨👩👧";
family.length; // 8 (UTF-16 code units: surrogates + ZWJ)
[...family].length; // 8 (code points — spread operator)
// True grapheme count requires Intl.Segmenter (ES2022)
const segmenter = new Intl.Segmenter();
[...segmenter.segment(family)].length; // 1 (one grapheme cluster!)
Byte Length
Byte length depends on the encoding:
s = "Hello, 世界 🌍"
len(s.encode("utf-8")) # 19 bytes
len(s.encode("utf-16")) # 24 bytes (with BOM: 26)
len(s.encode("utf-32")) # 44 bytes (with BOM: 48)
For database column sizing (e.g., VARCHAR(100) in PostgreSQL with UTF-8), the limit is in bytes. A 100-character CJK string needs up to 300 bytes in UTF-8.
Grapheme Cluster Length
A grapheme cluster is what a user perceives as a single character. It may span multiple code points:
- Precomposed vs. decomposed:
écan be one code point (U+00E9) or two (e + U+0301). - Emoji sequences:
👨👩👧= 3 base emoji + 2 ZWJ = 8 code points → 1 grapheme. - Flag emoji: 🇺🇸 = 2 regional indicator letters → 1 grapheme.
- Keycap sequences:
1️⃣= digit + VS16 + combining enclosing keycap → 1 grapheme.
# Python: grapheme cluster length
# Standard library has no built-in grapheme segmenter
# Use the 'grapheme' package
import grapheme
grapheme.length("👨👩👧") # 1
grapheme.length("café") # 4 (NFC) or 4 (NFD renders as 4 visible chars)
grapheme.length("🇺🇸") # 1
grapheme.length("e\u0301") # 1 (e + combining accent = 1 grapheme)
Database Implications
PostgreSQL char_length() counts code points; octet_length() counts bytes:
SELECT
char_length('café'), -- 4 (code points)
octet_length('café'), -- 5 (UTF-8 bytes: é = 2 bytes)
length('café'), -- alias for char_length in text context
char_length('👨👩👧'), -- 8 (code points)
octet_length('👨👩👧'); -- 25 (UTF-8 bytes)
Quick Facts
| Measure | Python | JavaScript | Notes |
|---|---|---|---|
| Code points | len(s) |
[...s].length |
Unicode scalar values |
| UTF-16 units | len(s.encode("utf-16-le")) // 2 |
s.length |
JS native unit |
| UTF-8 bytes | len(s.encode("utf-8")) |
new TextEncoder().encode(s).length |
|
| Graphemes | grapheme.length(s) (package) |
[...new Intl.Segmenter().segment(s)].length |
User-perceived chars |
相关术语
编程与开发 中的更多内容
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
使用Unicode属性的正则表达式模式:\p{L}(任意字母)、\p{Script=Greek}(希腊文字)、\p{Emoji},各语言和正则引擎的支持程度不同。
在源代码中表示Unicode字符的语法,各语言不同:\u2713(Python/Java/JS)、\u{2713}(JS/Ruby/Rust)、\U00012345(Python/C)。
无可见字形的字符:空白、零宽字符、控制字符和格式字符,可能引发欺骗和文本隐写等安全问题。
用错误编码解码字节时产生的乱码文本,来自日语词“文字化け”。例如:'café'以UTF-8存储但用Latin-1读取,显示为'café'。
在UTF-16中一起编码补充字符的两个16位码元(高代理U+D800–U+DBFF + 低代理U+DC00–U+DFFF),😀 = D83D DE00。
编程语言中的字符序列,内部表示各异:UTF-8(Go、Rust、新版Python)、UTF-16(Java、JavaScript、C#)或UTF-32(Python)。
U+FFFD(�),解码器遇到无效字节序列时显示的字符——“解码出错”的通用符号。