What is หน่วยรหัส?

หน่วยการเข้ารหัสขั้นต่ำ: ไบต์ 8 บิตใน UTF-8, คำ 16 บิตใน UTF-16, คำ 32 บิตใน UTF-32 อักขระเดี่ยวอาจต้องใช้หลายหน่วยรหัส

What is จุดรหัส?

ค่าตัวเลขในพื้นที่รหัส Unicode (U+0000 ถึง U+10FFFF) เขียนในรูปแบบ U+XXXX ไม่ใช่ทุกจุดรหัสที่จะถูกกำหนดให้กับอักขระ

What is กลุ่มกราฟีม?

อักขระที่ผู้ใช้รับรู้ได้ — สิ่งที่รู้สึกเหมือนหน่วยเดียว อาจประกอบด้วยหลายจุดรหัส (ฐาน + เครื่องหมายรวม หรือลำดับ emoji ZWJ) 👩💻 = 3 จุดรหัส, 1 grapheme

การเขียนโปรแกรมและการพัฒนา

ความกำกวมของความยาวสตริง

"ความยาว" ของสตริง Unicode ขึ้นอยู่กับหน่วย: code unit (JavaScript .length), code point (Python len()) หรือ grapheme cluster 👨‍👩‍👧‍👦 = 7 code point, 1 grapheme

2024-04-29 · Updated 2024-11-21

What Is String Length and Why Is It Complicated?

Asking "how long is this string?" seems simple but has multiple valid answers depending on what you mean by "length":

Bytes: How many bytes does the encoded string occupy?
Code units: How many code units in the string's internal encoding (UTF-16, UTF-8)?
Code points: How many Unicode code points (scalar values) does the string contain?
Grapheme clusters: How many user-perceived characters (glyphs the user would call "one letter") does the string contain?

These four counts can all be different for the same string.

A Concrete Example

Consider the string "café" followed by a family emoji "👨‍👩‍👧":

import unicodedata

s = "café"
# c-a-f-e+combining-accent OR c-a-f-é (precomposed)
# Decomposed NFC vs NFD matters here

nfc = unicodedata.normalize("NFC", "café")   # 4 code points
nfd = unicodedata.normalize("NFD", "café")   # 5 code points (e + combining acute)

len(nfc)  # 4
len(nfd)  # 5

# Family emoji: man + ZWJ + woman + ZWJ + girl = 8 code points, 1 grapheme
family = "👨‍👩‍👧"
len(family)   # 8 (code points: 3 emoji + 2 ZWJ)

const family = "👨‍👩‍👧";
family.length;          // 8 (UTF-16 code units: surrogates + ZWJ)
[...family].length;     // 8 (code points — spread operator)

// True grapheme count requires Intl.Segmenter (ES2022)
const segmenter = new Intl.Segmenter();
[...segmenter.segment(family)].length;  // 1 (one grapheme cluster!)

Byte Length

Byte length depends on the encoding:

s = "Hello, 世界 🌍"

len(s.encode("utf-8"))    # 19 bytes
len(s.encode("utf-16"))   # 24 bytes (with BOM: 26)
len(s.encode("utf-32"))   # 44 bytes (with BOM: 48)

For database column sizing (e.g., VARCHAR(100) in PostgreSQL with UTF-8), the limit is in bytes. A 100-character CJK string needs up to 300 bytes in UTF-8.

Grapheme Cluster Length

A grapheme cluster is what a user perceives as a single character. It may span multiple code points:

Precomposed vs. decomposed: é can be one code point (U+00E9) or two (e + U+0301).
Emoji sequences: 👨‍👩‍👧 = 3 base emoji + 2 ZWJ = 8 code points → 1 grapheme.
Flag emoji: 🇺🇸 = 2 regional indicator letters → 1 grapheme.
Keycap sequences: 1️⃣ = digit + VS16 + combining enclosing keycap → 1 grapheme.

# Python: grapheme cluster length
# Standard library has no built-in grapheme segmenter
# Use the 'grapheme' package
import grapheme

grapheme.length("👨‍👩‍👧")    # 1
grapheme.length("café")      # 4 (NFC) or 4 (NFD renders as 4 visible chars)
grapheme.length("🇺🇸")       # 1
grapheme.length("e\u0301")   # 1 (e + combining accent = 1 grapheme)

Database Implications

PostgreSQL char_length() counts code points; octet_length() counts bytes:

SELECT
  char_length('café'),           -- 4 (code points)
  octet_length('café'),          -- 5 (UTF-8 bytes: é = 2 bytes)
  length('café'),                 -- alias for char_length in text context
  char_length('👨‍👩‍👧'),     -- 8 (code points)
  octet_length('👨‍👩‍👧');    -- 25 (UTF-8 bytes)

Quick Facts

Measure	Python	JavaScript	Notes
Code points	`len(s)`	`[...s].length`	Unicode scalar values
UTF-16 units	`len(s.encode("utf-16-le")) // 2`	`s.length`	JS native unit
UTF-8 bytes	`len(s.encode("utf-8"))`	`new TextEncoder().encode(s).length`
Graphemes	`grapheme.length(s)` (package)	`[...new Intl.Segmenter().segment(s)].length`	User-perceived chars

คำศัพท์ที่เกี่ยวข้อง

หน่วยรหัส จุดรหัส กลุ่มกราฟีม

เพิ่มเติมใน การเขียนโปรแกรมและการพัฒนา

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Mojibake

ข้อความที่เสียหายจากการถอดรหัสไบต์ด้วยการเข้ารหัสผิด คำภาษาญี่ปุ่น (文字化け) ตัวอย่าง: 'café' เก็บเป็น UTF-8 แต่อ่านเป็น Latin-1 → 'cafÃ©'

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

การเข้ารหัส / การถอดรหัส

การเข้ารหัสแปลงอักขระเป็นไบต์ (str.encode('utf-8')); การถอดรหัสแปลงไบต์เป็นอักขระ (bytes.decode('utf-8')) การทำอย่างถูกต้องช่วยป้องกัน mojibake

คู่ตัวแทน

หน่วยโค้ด 16 บิตสองตัว (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) ที่เข้ารหัสอักขระเสริมใน UTF-16 …

นิพจน์ทั่วไป Unicode

รูปแบบ regex ที่ใช้คุณสมบัติ Unicode: \p{L} (ตัวอักษรใดก็ได้), \p{Script=Greek} (อักษรกรีก), \p{Emoji} การรองรับแตกต่างกันตามภาษาและ regex engine

ลำดับ escape ของ Unicode

ไวยากรณ์สำหรับแทนอักขระ Unicode ในซอร์สโค้ด แตกต่างกันตามภาษา: \u2713 (Python/Java/JS), \u{2713} (JS/Ruby/Rust), \U00012345 (Python/C)

สตริง

ลำดับของอักขระในภาษาโปรแกรม การแทนค่าภายในแตกต่างกัน: UTF-8 (Go, Rust, Python บิลด์ใหม่), UTF-16 (Java, JavaScript, C#) หรือ …

อักขระทดแทน

U+FFFD (�) แสดงเมื่อตัวถอดรหัสพบลำดับไบต์ที่ไม่ถูกต้อง เป็นสัญลักษณ์สากลสำหรับ "มีบางอย่างผิดพลาดกับการถอดรหัส"

← กลับไปยังอภิธานศัพท์