غموض طول السلسلة
طول النص في Unicode يعتمد على الوحدة: وحدات الترميز (JavaScript .length)، نقاط الترميز (Python len())، أو عناقيد الحروف الرسومية. 👨👩👧👦 = 7 نقاط ترميز، عنقود رسومي واحد.
What Is String Length and Why Is It Complicated?
Asking "how long is this string?" seems simple but has multiple valid answers depending on what you mean by "length":
- Bytes: How many bytes does the encoded string occupy?
- Code units: How many code units in the string's internal encoding (UTF-16, UTF-8)?
- Code points: How many Unicode code points (scalar values) does the string contain?
- Grapheme clusters: How many user-perceived characters (glyphs the user would call "one letter") does the string contain?
These four counts can all be different for the same string.
A Concrete Example
Consider the string "café" followed by a family emoji "👨👩👧":
import unicodedata
s = "café"
# c-a-f-e+combining-accent OR c-a-f-é (precomposed)
# Decomposed NFC vs NFD matters here
nfc = unicodedata.normalize("NFC", "café") # 4 code points
nfd = unicodedata.normalize("NFD", "café") # 5 code points (e + combining acute)
len(nfc) # 4
len(nfd) # 5
# Family emoji: man + ZWJ + woman + ZWJ + girl = 8 code points, 1 grapheme
family = "👨👩👧"
len(family) # 8 (code points: 3 emoji + 2 ZWJ)
const family = "👨👩👧";
family.length; // 8 (UTF-16 code units: surrogates + ZWJ)
[...family].length; // 8 (code points — spread operator)
// True grapheme count requires Intl.Segmenter (ES2022)
const segmenter = new Intl.Segmenter();
[...segmenter.segment(family)].length; // 1 (one grapheme cluster!)
Byte Length
Byte length depends on the encoding:
s = "Hello, 世界 🌍"
len(s.encode("utf-8")) # 19 bytes
len(s.encode("utf-16")) # 24 bytes (with BOM: 26)
len(s.encode("utf-32")) # 44 bytes (with BOM: 48)
For database column sizing (e.g., VARCHAR(100) in PostgreSQL with UTF-8), the limit is in bytes. A 100-character CJK string needs up to 300 bytes in UTF-8.
Grapheme Cluster Length
A grapheme cluster is what a user perceives as a single character. It may span multiple code points:
- Precomposed vs. decomposed:
écan be one code point (U+00E9) or two (e + U+0301). - Emoji sequences:
👨👩👧= 3 base emoji + 2 ZWJ = 8 code points → 1 grapheme. - Flag emoji: 🇺🇸 = 2 regional indicator letters → 1 grapheme.
- Keycap sequences:
1️⃣= digit + VS16 + combining enclosing keycap → 1 grapheme.
# Python: grapheme cluster length
# Standard library has no built-in grapheme segmenter
# Use the 'grapheme' package
import grapheme
grapheme.length("👨👩👧") # 1
grapheme.length("café") # 4 (NFC) or 4 (NFD renders as 4 visible chars)
grapheme.length("🇺🇸") # 1
grapheme.length("e\u0301") # 1 (e + combining accent = 1 grapheme)
Database Implications
PostgreSQL char_length() counts code points; octet_length() counts bytes:
SELECT
char_length('café'), -- 4 (code points)
octet_length('café'), -- 5 (UTF-8 bytes: é = 2 bytes)
length('café'), -- alias for char_length in text context
char_length('👨👩👧'), -- 8 (code points)
octet_length('👨👩👧'); -- 25 (UTF-8 bytes)
Quick Facts
| Measure | Python | JavaScript | Notes |
|---|---|---|---|
| Code points | len(s) |
[...s].length |
Unicode scalar values |
| UTF-16 units | len(s.encode("utf-16-le")) // 2 |
s.length |
JS native unit |
| UTF-8 bytes | len(s.encode("utf-8")) |
new TextEncoder().encode(s).length |
|
| Graphemes | grapheme.length(s) (package) |
[...new Intl.Segmenter().segment(s)].length |
User-perceived chars |
المصطلحات ذات الصلة
المزيد في البرمجة والتطوير
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
نص مشوّه ناتج عن فك تشفير البايتات بترميز خاطئ. مصطلح ياباني (文字化け). …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
الترميز يحوّل الأحرف إلى بايتات (str.encode('utf-8'))؛ فك الترميز يحوّل البايتات إلى أحرف …
أنماط Regex باستخدام خصائص Unicode: \p{L} (أي حرف)، \p{Script=Greek} (نص يوناني)، \p{Emoji}. …
صيغة لتمثيل أحرف Unicode في الكود المصدري. تختلف حسب اللغة: \u2713 (Python/Java/JS)، …
U+FFFD (�). يُعرض عندما يواجه فاك التشفير تسلسلات بايتات غير صالحة — …
U+0000 (NUL). أول حرف في Unicode/ASCII، يُستخدم كمُنهٍ للنصوص في C/C++. خطر …
أي حرف بدون شكل رسومي مرئي: مسافات بيضاء، أحرف بعرض صفري، أحرف …