💻 Unicode in Code

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode text handling at the type level. This guide explains Rust's Unicode guarantees, how to iterate over characters and bytes, and the unicode-segmentation crate for grapheme cluster support.

Published 2022-03-21 · Updated 2024-12-19

Rust takes an uncompromising stance on Unicode: the String type and &str slices are guaranteed to contain valid UTF-8 at all times. The compiler enforces this invariant, and attempting to create a String from invalid bytes is a compile-time or runtime error -- never silently accepted. This makes Rust one of the safest languages for Unicode text processing, but it also means that string operations behave differently from languages like C or Python.

Strings Are Valid UTF-8

Rust has two primary string types:

Type	Ownership	Guarantee
`String`	Owned, heap-allocated, growable	Valid UTF-8
`&str`	Borrowed reference (string slice)	Valid UTF-8

let s: String = String::from("Hello, 世界");
let slice: &str = "Hello, 世界";     // string literal → &str

// Both are guaranteed valid UTF-8
assert!(std::str::from_utf8(s.as_bytes()).is_ok());

There is no way to construct a String or &str containing invalid UTF-8 through safe Rust. If you have arbitrary bytes, you must explicitly validate or convert them:

let bytes: &[u8] = b"\xC3\xA9";                 // valid UTF-8 for é
let text = std::str::from_utf8(bytes).unwrap();   // Ok("é")

let bad: &[u8] = b"\xFF\xFE";
let result = std::str::from_utf8(bad);            // Err(Utf8Error)

The char Type: A Unicode Scalar Value

Rust's char is a 32-bit type representing a Unicode scalar value -- any code point except surrogates (U+D800 to U+DFFF):

let c: char = 'é';
println!("{}", c as u32);         // 233  (U+00E9)
println!("{:?}", c);              // 'é'

let emoji: char = '😀';
println!("U+{:04X}", emoji as u32);  // U+1F600

// char is always 4 bytes in memory
assert_eq!(std::mem::size_of::<char>(), 4);

This means char can represent any Unicode code point including supplementary characters. There is no surrogate-pair confusion as in Java or JavaScript.

char Methods

Rust's char has rich built-in methods:

'A'.is_alphabetic()          // true
'7'.is_numeric()             // true
'σ'.is_lowercase()           // true
'Σ'.is_uppercase()           // true
' '.is_whitespace()          // true
'\t'.is_control()            // true (U+0009 is a control character)

// Case conversion returns an iterator (because some conversions
// produce multiple characters, e.g., German ß → SS)
let upper: String = 'ß'.to_uppercase().collect();
assert_eq!(upper, "SS");

'A'.to_lowercase().collect::<String>()   // "a"

The fact that to_uppercase() returns an iterator instead of a single char reflects Rust's commitment to correctness: the German lowercase ß uppercases to the two-character sequence SS, and a single char cannot hold that.

String Length: Bytes vs. Chars vs. Graphemes

Rust offers multiple ways to measure a string, and they give different answers:

let s = "Hello, 世界! 👋";

// Byte length (O(1), always available)
println!("{}", s.len());             // 19 bytes

// Character (scalar value) count (O(n), must iterate)
println!("{}", s.chars().count());   // 11 characters

// For grapheme clusters, use the unicode-segmentation crate

Metric	Method	Value for "Hello, 世界! 👋"
Bytes	`.len()`	19
Chars (scalar values)	`.chars().count()`	11
Grapheme clusters	`UnicodeSegmentation`	11

For text containing combining characters, emoji sequences (ZWJ families, flags), or variation selectors, char count and grapheme count diverge significantly.

Iterating Over Strings

Rust provides three built-in iteration methods:

let s = "café";

// Iterate over characters (Unicode scalar values)
for c in s.chars() {
    print!("{} ", c);       // c a f é
}

// Iterate over bytes
for b in s.bytes() {
    print!("{:02X} ", b);   // 63 61 66 C3 A9
}

// Iterate over character indices (byte offsets)
for (i, c) in s.char_indices() {
    println!("byte {} → '{}'", i, c);
}
// byte 0 → 'c'
// byte 1 → 'a'
// byte 2 → 'f'
// byte 3 → 'é'   (byte offset 3, not 4)

String Slicing: The Byte Boundary Rule

You can slice a &str with byte ranges, but the slice must fall on UTF-8 character boundaries or Rust will panic at runtime:

let s = "café";

// OK: slicing at character boundaries
let caf = &s[0..3];       // "caf"  (bytes 0, 1, 2)
let e_accent = &s[3..5];  // "é"    (bytes 3, 4)

// PANIC: byte 4 is in the middle of the é sequence
// let bad = &s[0..4];    // thread panics: "byte index 4 is not a char boundary"

This compile-time-safe / runtime-checked design prevents the creation of invalid UTF-8 strings but requires you to know your byte boundaries. Use char_indices() to find safe split points:

fn safe_truncate(s: &str, max_chars: usize) -> &str {
    match s.char_indices().nth(max_chars) {
        Some((byte_idx, _)) => &s[..byte_idx],
        None => s,
    }
}

safe_truncate("café", 3)   // "caf"
safe_truncate("日本語", 2)  // "日本"

Building Strings

// From a literal
let s = String::from("hello");

// Push characters
let mut s = String::new();
s.push('H');
s.push('é');
s.push_str("llo");           // append a &str
assert_eq!(s, "Héllo");

// From code point (integer → char → String)
let c = char::from_u32(0x1F600).unwrap();  // 😀
let s = c.to_string();

// Format macro
let s = format!("Hello {} U+{:04X}", '世', '世' as u32);

Working with Bytes: OsString, CString, and Vec

Not all text in the real world is valid UTF-8. Rust provides separate types for these cases:

Type	Use Case	Guarantee
`String` / `&str`	General text	Valid UTF-8
`OsString` / `&OsStr`	File paths, OS interfaces	Platform-dependent
`CString` / `&CStr`	C interop (null-terminated)	No internal NULs
`Vec<u8>` / `&[u8]`	Arbitrary bytes	None

use std::ffi::OsStr;
use std::path::Path;

// File paths may not be valid UTF-8 on Unix
let path = Path::new("/tmp/café");
let os_str: &OsStr = path.as_os_str();

// Lossy conversion (replaces invalid sequences with U+FFFD)
let lossy: String = os_str.to_string_lossy().into_owned();

The unicode-segmentation Crate

The standard library handles code points but not grapheme clusters. The unicode-segmentation crate (maintained by the unicode-rs project) provides UAX #29 grapheme cluster segmentation:

use unicode_segmentation::UnicodeSegmentation;

let s = "e\u{0301}";          // é as e + combining acute
let graphemes: Vec<&str> = s.graphemes(true).collect();
assert_eq!(graphemes, vec!["é"]);
assert_eq!(s.chars().count(), 2);       // 2 scalar values
assert_eq!(graphemes.len(), 1);         // 1 grapheme cluster

// Family emoji
let family = "👨\u{200D}👩\u{200D}👧\u{200D}👦";
let g: Vec<&str> = family.graphemes(true).collect();
assert_eq!(g.len(), 1);  // 1 visible character

Normalization

The unicode-normalization crate provides NFC, NFD, NFKC, and NFKD:

use unicode_normalization::UnicodeNormalization;

let nfd = "e\u{0301}";                     // NFD
let nfc: String = nfd.nfc().collect();      // NFC
assert_eq!(nfc, "\u{00E9}");

// Check normalization form
use unicode_normalization::is_nfc;
assert!(is_nfc("\u{00E9}"));
assert!(!is_nfc("e\u{0301}"));

Regular Expressions

The regex crate handles Unicode by default:

use regex::Regex;

let re = Regex::new(r"\p{L}+").unwrap();    // Unicode letters
let matches: Vec<&str> = re.find_iter("café 日本語 Москва")
    .map(|m| m.as_str())
    .collect();
// ["café", "日本語", "Москва"]

// Script-specific matching
Regex::new(r"\p{Greek}+").unwrap();
Regex::new(r"\p{Cyrillic}+").unwrap();
Regex::new(r"\p{Han}+").unwrap();

Common Pitfalls

1. Indexing by Integer

Rust does not allow s[0] to return a character. The Index<usize> trait is not implemented for String or &str:

let s = "hello";
// let c = s[0];   // ERROR: String cannot be indexed by usize
let c = s.chars().nth(0);   // Some('h')

This is deliberate: indexing should be O(1), but finding the nth character in a UTF-8 string is O(n).

2. Byte Length Surprises

"A".len()       // 1
"é".len()       // 2
"世".len()      // 3
"😀".len()      // 4

3. Case Conversion Produces Multiple Characters

// ß uppercases to SS (two characters)
let upper: String = "straße".to_uppercase();
assert_eq!(upper, "STRASSE");
assert_eq!("straße".len(), 6);    // 6 bytes
assert_eq!(upper.len(), 7);       // 7 bytes (one more character)

Quick Reference

Task	Code
Byte length	`s.len()`
Char count	`s.chars().count()`
Iterate chars	`for c in s.chars() { }`
Iterate with indices	`for (i, c) in s.char_indices() { }`
Safe truncate	`s.char_indices().nth(n).map(\|(i,_)\| &s[..i])`
Char to code point	`c as u32`
Code point to char	`char::from_u32(0x1F600)`
Normalize NFC	`s.nfc().collect::<String>()`
Grapheme clusters	`s.graphemes(true).collect::<Vec<&str>>()`
Regex Unicode	`Regex::new(r"\p{L}+")`
Lossy conversion	`String::from_utf8_lossy(bytes)`

Rust's type system turns Unicode correctness from a discipline into a guarantee. The compiler ensures that String and &str are always valid UTF-8, char is always a valid scalar value, and slicing always respects character boundaries (or panics). The trade-off is that you need external crates for normalization and grapheme segmentation -- but those crates are mature, well-maintained, and widely used throughout the Rust ecosystem.

Ещё в Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Вернуться к руководствам