U+ UnicodeFYI
प्रोग्रामिंग और विकास

स्ट्रिंग लंबाई अस्पष्टता

Unicode string की 'लंबाई' इकाई पर निर्भर करती है: code units (JavaScript .length), code points (Python len()), या grapheme clusters। 👨‍👩‍👧‍👦 = 7 code points, 1 grapheme।

· अपडेट किया गया

What Is String Length and Why Is It Complicated?

Asking "how long is this string?" seems simple but has multiple valid answers depending on what you mean by "length":

  1. Bytes: How many bytes does the encoded string occupy?
  2. Code units: How many code units in the string's internal encoding (UTF-16, UTF-8)?
  3. Code points: How many Unicode code points (scalar values) does the string contain?
  4. Grapheme clusters: How many user-perceived characters (glyphs the user would call "one letter") does the string contain?

These four counts can all be different for the same string.

A Concrete Example

Consider the string "café" followed by a family emoji "👨‍👩‍👧":

import unicodedata

s = "café"
# c-a-f-e+combining-accent OR c-a-f-é (precomposed)
# Decomposed NFC vs NFD matters here

nfc = unicodedata.normalize("NFC", "café")   # 4 code points
nfd = unicodedata.normalize("NFD", "café")   # 5 code points (e + combining acute)

len(nfc)  # 4
len(nfd)  # 5

# Family emoji: man + ZWJ + woman + ZWJ + girl = 8 code points, 1 grapheme
family = "👨‍👩‍👧"
len(family)   # 8 (code points: 3 emoji + 2 ZWJ)
const family = "👨‍👩‍👧";
family.length;          // 8 (UTF-16 code units: surrogates + ZWJ)
[...family].length;     // 8 (code points — spread operator)

// True grapheme count requires Intl.Segmenter (ES2022)
const segmenter = new Intl.Segmenter();
[...segmenter.segment(family)].length;  // 1 (one grapheme cluster!)

Byte Length

Byte length depends on the encoding:

s = "Hello, 世界 🌍"

len(s.encode("utf-8"))    # 19 bytes
len(s.encode("utf-16"))   # 24 bytes (with BOM: 26)
len(s.encode("utf-32"))   # 44 bytes (with BOM: 48)

For database column sizing (e.g., VARCHAR(100) in PostgreSQL with UTF-8), the limit is in bytes. A 100-character CJK string needs up to 300 bytes in UTF-8.

Grapheme Cluster Length

A grapheme cluster is what a user perceives as a single character. It may span multiple code points:

  • Precomposed vs. decomposed: é can be one code point (U+00E9) or two (e + U+0301).
  • Emoji sequences: 👨‍👩‍👧 = 3 base emoji + 2 ZWJ = 8 code points → 1 grapheme.
  • Flag emoji: 🇺🇸 = 2 regional indicator letters → 1 grapheme.
  • Keycap sequences: 1️⃣ = digit + VS16 + combining enclosing keycap → 1 grapheme.
# Python: grapheme cluster length
# Standard library has no built-in grapheme segmenter
# Use the 'grapheme' package
import grapheme

grapheme.length("👨‍👩‍👧")    # 1
grapheme.length("café")      # 4 (NFC) or 4 (NFD renders as 4 visible chars)
grapheme.length("🇺🇸")       # 1
grapheme.length("e\u0301")   # 1 (e + combining accent = 1 grapheme)

Database Implications

PostgreSQL char_length() counts code points; octet_length() counts bytes:

SELECT
  char_length('café'),           -- 4 (code points)
  octet_length('café'),          -- 5 (UTF-8 bytes: é = 2 bytes)
  length('café'),                 -- alias for char_length in text context
  char_length('👨‍👩‍👧'),     -- 8 (code points)
  octet_length('👨‍👩‍👧');    -- 25 (UTF-8 bytes)

Quick Facts

Measure Python JavaScript Notes
Code points len(s) [...s].length Unicode scalar values
UTF-16 units len(s.encode("utf-16-le")) // 2 s.length JS native unit
UTF-8 bytes len(s.encode("utf-8")) new TextEncoder().encode(s).length
Graphemes grapheme.length(s) (package) [...new Intl.Segmenter().segment(s)].length User-perceived chars

संबंधित शब्द

प्रोग्रामिंग और विकास में और

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

अदृश्य वर्ण

कोई भी वर्ण जिसका कोई दृश्य ग्लिफ़ नहीं है: whitespace, zero-width वर्ण, …

एन्कोडिंग / डिकोडिंग

Encoding वर्णों को bytes में परिवर्तित करता है (str.encode('utf-8')); decoding bytes को …

नल वर्ण

U+0000 (NUL)। पहला Unicode/ASCII वर्ण, C/C++ में string terminator के रूप में …

प्रतिस्थापन वर्ण

U+FFFD (�)। जब decoder अमान्य byte sequences का सामना करता है तो …

मोजिबेक (अपठनीय पाठ)

गलत encoding से bytes को decode करने के कारण गड़बड़ हुआ टेक्स्ट। …

यूनिकोड एस्केप अनुक्रम

सोर्स कोड में Unicode वर्णों को दर्शाने के लिए सिंटैक्स। भाषा के …

यूनिकोड रेगुलर एक्सप्रेशन

Unicode properties का उपयोग करने वाले regex पैटर्न: \p{L} (कोई भी अक्षर), …