Lập trình và phát triển

Sự mơ hồ về độ dài chuỗi

"Độ dài" của một chuỗi Unicode phụ thuộc vào đơn vị: đơn vị mã (JavaScript .length), điểm mã (Python len()) hoặc cụm grapheme. 👨‍👩‍👧‍👦 = 7 điểm mã, 1 grapheme.

· Cap nhat

What Is String Length and Why Is It Complicated?

Asking "how long is this string?" seems simple but has multiple valid answers depending on what you mean by "length":

  1. Bytes: How many bytes does the encoded string occupy?
  2. Code units: How many code units in the string's internal encoding (UTF-16, UTF-8)?
  3. Code points: How many Unicode code points (scalar values) does the string contain?
  4. Grapheme clusters: How many user-perceived characters (glyphs the user would call "one letter") does the string contain?

These four counts can all be different for the same string.

A Concrete Example

Consider the string "café" followed by a family emoji "👨‍👩‍👧":

import unicodedata

s = "café"
# c-a-f-e+combining-accent OR c-a-f-é (precomposed)
# Decomposed NFC vs NFD matters here

nfc = unicodedata.normalize("NFC", "café")   # 4 code points
nfd = unicodedata.normalize("NFD", "café")   # 5 code points (e + combining acute)

len(nfc)  # 4
len(nfd)  # 5

# Family emoji: man + ZWJ + woman + ZWJ + girl = 8 code points, 1 grapheme
family = "👨‍👩‍👧"
len(family)   # 8 (code points: 3 emoji + 2 ZWJ)
const family = "👨‍👩‍👧";
family.length;          // 8 (UTF-16 code units: surrogates + ZWJ)
[...family].length;     // 8 (code points — spread operator)

// True grapheme count requires Intl.Segmenter (ES2022)
const segmenter = new Intl.Segmenter();
[...segmenter.segment(family)].length;  // 1 (one grapheme cluster!)

Byte Length

Byte length depends on the encoding:

s = "Hello, 世界 🌍"

len(s.encode("utf-8"))    # 19 bytes
len(s.encode("utf-16"))   # 24 bytes (with BOM: 26)
len(s.encode("utf-32"))   # 44 bytes (with BOM: 48)

For database column sizing (e.g., VARCHAR(100) in PostgreSQL with UTF-8), the limit is in bytes. A 100-character CJK string needs up to 300 bytes in UTF-8.

Grapheme Cluster Length

A grapheme cluster is what a user perceives as a single character. It may span multiple code points:

  • Precomposed vs. decomposed: é can be one code point (U+00E9) or two (e + U+0301).
  • Emoji sequences: 👨‍👩‍👧 = 3 base emoji + 2 ZWJ = 8 code points → 1 grapheme.
  • Flag emoji: 🇺🇸 = 2 regional indicator letters → 1 grapheme.
  • Keycap sequences: 1️⃣ = digit + VS16 + combining enclosing keycap → 1 grapheme.
# Python: grapheme cluster length
# Standard library has no built-in grapheme segmenter
# Use the 'grapheme' package
import grapheme

grapheme.length("👨‍👩‍👧")    # 1
grapheme.length("café")      # 4 (NFC) or 4 (NFD renders as 4 visible chars)
grapheme.length("🇺🇸")       # 1
grapheme.length("e\u0301")   # 1 (e + combining accent = 1 grapheme)

Database Implications

PostgreSQL char_length() counts code points; octet_length() counts bytes:

SELECT
  char_length('café'),           -- 4 (code points)
  octet_length('café'),          -- 5 (UTF-8 bytes: é = 2 bytes)
  length('café'),                 -- alias for char_length in text context
  char_length('👨‍👩‍👧'),     -- 8 (code points)
  octet_length('👨‍👩‍👧');    -- 25 (UTF-8 bytes)

Quick Facts

Measure Python JavaScript Notes
Code points len(s) [...s].length Unicode scalar values
UTF-16 units len(s.encode("utf-16-le")) // 2 s.length JS native unit
UTF-8 bytes len(s.encode("utf-8")) new TextEncoder().encode(s).length
Graphemes grapheme.length(s) (package) [...new Intl.Segmenter().segment(s)].length User-perceived chars

Thuật ngữ liên quan

Thêm trong Lập trình và phát triển