The Developer's Unicode Handbook · Глава 1

String Length Is a Lie

Python's len(), JavaScript's .length, and Java's .length() all lie about string length. This chapter explains the difference between code units, code points, and grapheme clusters, with code examples for counting characters correctly.

~4 000 слов · ~16 мин чтения · · Updated

Every programmer learns early that len(s) returns the length of a string. What they don't learn until a production bug bites them is that "length" has at least four completely different meanings in Unicode, and which one you get depends entirely on which language you're using — and sometimes on which version of that language.

The family emoji 👨‍👩‍👧‍👦 is a perfect teacher. It looks like one character. You can select it in one click. But ask different languages how long it is and you'll get wildly different answers.

The Four Lengths of a String

Before diving into language-specific behavior, we need to establish the vocabulary. A string's "length" can mean any of the following:

Bytes: How many bytes the encoded form occupies in memory or on disk. UTF-8 encodes ASCII in 1 byte but CJK characters in 3 bytes. UTF-16 uses 2 bytes for BMP characters and 4 bytes for supplementary characters.

Code units: The natural integer unit for a given encoding. UTF-8 code units are bytes. UTF-16 code units are 16-bit integers. Java String and JavaScript string are internally UTF-16, so their .length counts 16-bit code units.

Code points: The abstract Unicode scalar values, U+0000 through U+10FFFF. Python 3's len() counts these. An emoji like 😀 is one code point (U+1F600) but two UTF-16 code units (a surrogate pair: \uD83D\uDE00).

Grapheme clusters: What users perceive as a single "character." A grapheme cluster may span multiple code points. The family emoji 👨‍👩‍👧‍👦 is one grapheme cluster composed of 7 code points joined by ZWJ (U+200D).

The Family Emoji Case Study

Let's measure 👨‍👩‍👧‍👦 in every major language:

# Python 3 — counts Unicode code points
s = "👨‍👩‍👧‍👦"
print(len(s))   # 11 (7 emoji + 4 ZWJ separators... wait, it's 11 total)
# Actually: 👨 U+1F468, ZWJ, 👩 U+1F469, ZWJ, 👧 U+1F467, ZWJ, 👦 U+1F466
# = 4 emoji + 3 ZWJ = 7 code points... hmm, let's check
print([hex(ord(c)) for c in s])
# depends on Python version; grapheme clusters not counted at all
// JavaScript — counts UTF-16 code units
const s = "👨‍👩‍👧‍👦";
console.log(s.length);        // 25 (each emoji is 2 code units, ZWJ is 1)
console.log([...s].length);   // 8 (spread iterates code points)
// But that's still not grapheme clusters
// Java — also UTF-16 code units
String s = "\\uD83D\\uDC68\\u200D\\uD83D\\uDC69\\u200D\\uD83D\\uDC67\\u200D\\uD83D\\uDC66";
System.out.println(s.length());           // 25
System.out.println(s.codePointCount(0, s.length())); // 7 (code points)
// Rust — multiple metrics available
let s = "\\u{1F468}\\u{200D}\\u{1F469}\\u{200D}\\u{1F467}\\u{200D}\\u{1F466}";
println!("{}", s.len());       // 25 (bytes, UTF-8 encoding!)
println!("{}", s.chars().count()); // 7 (Unicode scalar values)
// grapheme clusters need the `unicode-segmentation` crate
// Swift — the best default behavior of any mainstream language
let s = "\\u{1F468}\\u{200D}\\u{1F469}\\u{200D}\\u{1F467}\\u{200D}\\u{1F466}"
print(s.count)              // 1 (grapheme clusters — correct!)
print(s.unicodeScalars.count)  // 7 (code points)
print(s.utf16.count)        // 25 (UTF-16 code units)
print(s.utf8.count)         // 25 (UTF-8 bytes)

Swift is the only mainstream language where string.count gives you grapheme clusters by default. Everyone else makes you do extra work.

Python: Close but Not Quite

Python 3's len() counts Unicode code points, which is better than UTF-16 code units. But it's still not grapheme clusters:

# Python 3 code points vs grapheme clusters
import unicodedata

# Combining characters: "é" can be e + combining acute accent
e_composed = "\\u00E9"        # NFC: single code point
e_decomposed = "e\\u0301"    # NFD: two code points

print(len(e_composed))    # 1
print(len(e_decomposed))  # 2  ← surprise!

# They look identical but have different lengths
print(e_composed == e_decomposed)  # False (without normalization)

# For grapheme clusters, use the `grapheme` package
# pip install grapheme
import grapheme
print(grapheme.length(e_composed))    # 1
print(grapheme.length(e_decomposed))  # 1  ← correct
print(grapheme.length("👨‍👩‍👧‍👦"))        # 1  ← correct

For slicing by grapheme cluster in Python, the grapheme package is the most straightforward option:

import grapheme

text = "Hello, \\U0001F468\\u200D\\U0001F469\\u200D\\U0001F467\\u200D\\U0001F466!"

# Slice first 10 "characters" (grapheme clusters)
clusters = list(grapheme.graphemes(text))
first_ten = "".join(clusters[:10])
print(first_ten)  # "Hello, 👨‍👩‍👧‍👦!"  — only 10 visible chars, correct

JavaScript: The .length Trap

JavaScript's String.prototype.length is almost certainly the most common source of Unicode bugs in web applications. It counts UTF-16 code units. Every emoji outside the BMP (U+FFFF+) counts as 2.

// The spread operator gives code points
const emoji = "\\u{1F600}";
console.log(emoji.length);       // 2 (UTF-16 surrogate pair)
console.log([...emoji].length);  // 1 (code points)

// But grapheme clusters still require Intl.Segmenter
const segmenter = new Intl.Segmenter();
const segments = [...segmenter.segment("👨‍👩‍👧‍👦")];
console.log(segments.length);  // 1 (correct!)

// Safe truncation without breaking emoji
function truncateToGraphemes(str, maxLength) {
  const segmenter = new Intl.Segmenter();
  const segments = [...segmenter.segment(str)];
  return segments.slice(0, maxLength).map(s => s.segment).join("");
}

const text = "Hello 👨‍👩‍👧‍👦 World";
console.log(truncateToGraphemes(text, 8));  // "Hello 👨‍👩‍👧‍👦 W"

Intl.Segmenter was added in ES2022 and is available in all modern browsers. For older environments, the graphemer npm package is a polyfill.

Java and Rust: The Explicit Approach

Java gives you the most explicit control, which is both its strength and its verbosity:

import java.text.BreakIterator;

public class GraphemeCount {
    public static int countGraphemes(String text) {
        BreakIterator it = BreakIterator.getCharacterInstance();
        it.setText(text);
        int count = 0;
        while (it.next() != BreakIterator.DONE) count++;
        return count;
    }

    public static void main(String[] args) {
        String family = "\\uD83D\\uDC68\\u200D\\uD83D\\uDC69\\u200D\\uD83D\\uDC67\\u200D\\uD83D\\uDC66";
        System.out.println(countGraphemes(family));  // 1
    }
}

Rust, with the unicode-segmentation crate, provides an idiomatic iterator:

use unicode_segmentation::UnicodeSegmentation;

fn main() {
    let family = "\\u{1F468}\\u{200D}\\u{1F469}\\u{200D}\\u{1F467}\\u{200D}\\u{1F466}";

    let grapheme_count = family.graphemes(true).count();
    println!("{}", grapheme_count);  // 1

    // Truncate to first N grapheme clusters
    let truncated: String = family.graphemes(true).take(1).collect();
    println!("{}", truncated);  // 👨‍👩‍👧‍👦
}

The Regex \X Solution

In languages that support PCRE or ICU regex, \\X matches a single grapheme cluster. This is extremely useful for character-counting operations:

import regex  # pip install regex (not re)

text = "caf\\u00E9 \\U0001F468\\u200D\\U0001F469"

# Count grapheme clusters with \\X
import re as re_module
clusters = regex.findall(r"\\X", text)
print(len(clusters))  # 8 (c, a, f, é, space, 👨, ZWJ-sequence-is-one, 👩)
# Perl has \\X built into its regex engine
my $text = "caf\\x{00E9} \\x{1F468}\\x{200D}\\x{1F469}";
my @graphemes = ($text =~ /\\X/g);
print scalar @graphemes;  # 8

Practical Guidelines

For display length / truncation: Always use grapheme clusters. A tweet with "20 characters" means 20 grapheme clusters, not 20 bytes or 20 code points.

For storage size: Use bytes. Your database column VARCHAR(255) means 255 bytes in some databases. Check whether your ORM is transparently handling this.

For string indexing: Be aware that Python and JavaScript indexing is by code point and code unit respectively. Slicing s[0:5] can produce broken text if you split in the middle of a grapheme cluster.

For random access: If you need random access to grapheme clusters, pre-segment the string into a list of grapheme clusters first, then index into the list.

The bottom line: whenever a user interacts with "characters" — typing, selecting, copying, counting — they're thinking in grapheme clusters. Your code should too.