💻 Unicode in Code

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode text handling at the type level. This guide explains Rust's Unicode guarantees, how to iterate over characters and bytes, and the unicode-segmentation crate for grapheme cluster support.

·

Rust takes an uncompromising stance on Unicode: the String type and &str slices are guaranteed to contain valid UTF-8 at all times. The compiler enforces this invariant, and attempting to create a String from invalid bytes is a compile-time or runtime error -- never silently accepted. This makes Rust one of the safest languages for Unicode text processing, but it also means that string operations behave differently from languages like C or Python.

Strings Are Valid UTF-8

Rust has two primary string types:

Type Ownership Guarantee
String Owned, heap-allocated, growable Valid UTF-8
&str Borrowed reference (string slice) Valid UTF-8
let s: String = String::from("Hello, 世界");
let slice: &str = "Hello, 世界";     // string literal → &str

// Both are guaranteed valid UTF-8
assert!(std::str::from_utf8(s.as_bytes()).is_ok());

There is no way to construct a String or &str containing invalid UTF-8 through safe Rust. If you have arbitrary bytes, you must explicitly validate or convert them:

let bytes: &[u8] = b"\xC3\xA9";                 // valid UTF-8 for é
let text = std::str::from_utf8(bytes).unwrap();   // Ok("é")

let bad: &[u8] = b"\xFF\xFE";
let result = std::str::from_utf8(bad);            // Err(Utf8Error)

The char Type: A Unicode Scalar Value

Rust's char is a 32-bit type representing a Unicode scalar value -- any code point except surrogates (U+D800 to U+DFFF):

let c: char = 'é';
println!("{}", c as u32);         // 233  (U+00E9)
println!("{:?}", c);              // 'é'

let emoji: char = '😀';
println!("U+{:04X}", emoji as u32);  // U+1F600

// char is always 4 bytes in memory
assert_eq!(std::mem::size_of::<char>(), 4);

This means char can represent any Unicode code point including supplementary characters. There is no surrogate-pair confusion as in Java or JavaScript.

char Methods

Rust's char has rich built-in methods:

'A'.is_alphabetic()          // true
'7'.is_numeric()             // true
'σ'.is_lowercase()           // true
'Σ'.is_uppercase()           // true
' '.is_whitespace()          // true
'\t'.is_control()            // true (U+0009 is a control character)

// Case conversion returns an iterator (because some conversions
// produce multiple characters, e.g., German ß → SS)
let upper: String = 'ß'.to_uppercase().collect();
assert_eq!(upper, "SS");

'A'.to_lowercase().collect::<String>()   // "a"

The fact that to_uppercase() returns an iterator instead of a single char reflects Rust's commitment to correctness: the German lowercase ß uppercases to the two-character sequence SS, and a single char cannot hold that.

String Length: Bytes vs. Chars vs. Graphemes

Rust offers multiple ways to measure a string, and they give different answers:

let s = "Hello, 世界! 👋";

// Byte length (O(1), always available)
println!("{}", s.len());             // 19 bytes

// Character (scalar value) count (O(n), must iterate)
println!("{}", s.chars().count());   // 11 characters

// For grapheme clusters, use the unicode-segmentation crate
Metric Method Value for "Hello, 世界! 👋"
Bytes .len() 19
Chars (scalar values) .chars().count() 11
Grapheme clusters UnicodeSegmentation 11

For text containing combining characters, emoji sequences (ZWJ families, flags), or variation selectors, char count and grapheme count diverge significantly.

Iterating Over Strings

Rust provides three built-in iteration methods:

let s = "café";

// Iterate over characters (Unicode scalar values)
for c in s.chars() {
    print!("{} ", c);       // c a f é
}

// Iterate over bytes
for b in s.bytes() {
    print!("{:02X} ", b);   // 63 61 66 C3 A9
}

// Iterate over character indices (byte offsets)
for (i, c) in s.char_indices() {
    println!("byte {} → '{}'", i, c);
}
// byte 0 → 'c'
// byte 1 → 'a'
// byte 2 → 'f'
// byte 3 → 'é'   (byte offset 3, not 4)

String Slicing: The Byte Boundary Rule

You can slice a &str with byte ranges, but the slice must fall on UTF-8 character boundaries or Rust will panic at runtime:

let s = "café";

// OK: slicing at character boundaries
let caf = &s[0..3];       // "caf"  (bytes 0, 1, 2)
let e_accent = &s[3..5];  // "é"    (bytes 3, 4)

// PANIC: byte 4 is in the middle of the é sequence
// let bad = &s[0..4];    // thread panics: "byte index 4 is not a char boundary"

This compile-time-safe / runtime-checked design prevents the creation of invalid UTF-8 strings but requires you to know your byte boundaries. Use char_indices() to find safe split points:

fn safe_truncate(s: &str, max_chars: usize) -> &str {
    match s.char_indices().nth(max_chars) {
        Some((byte_idx, _)) => &s[..byte_idx],
        None => s,
    }
}

safe_truncate("café", 3)   // "caf"
safe_truncate("日本語", 2)  // "日本"

Building Strings

// From a literal
let s = String::from("hello");

// Push characters
let mut s = String::new();
s.push('H');
s.push('é');
s.push_str("llo");           // append a &str
assert_eq!(s, "Héllo");

// From code point (integer → char → String)
let c = char::from_u32(0x1F600).unwrap();  // 😀
let s = c.to_string();

// Format macro
let s = format!("Hello {} U+{:04X}", '世', '世' as u32);

Working with Bytes: OsString, CString, and Vec

Not all text in the real world is valid UTF-8. Rust provides separate types for these cases:

Type Use Case Guarantee
String / &str General text Valid UTF-8
OsString / &OsStr File paths, OS interfaces Platform-dependent
CString / &CStr C interop (null-terminated) No internal NULs
Vec<u8> / &[u8] Arbitrary bytes None
use std::ffi::OsStr;
use std::path::Path;

// File paths may not be valid UTF-8 on Unix
let path = Path::new("/tmp/café");
let os_str: &OsStr = path.as_os_str();

// Lossy conversion (replaces invalid sequences with U+FFFD)
let lossy: String = os_str.to_string_lossy().into_owned();

The unicode-segmentation Crate

The standard library handles code points but not grapheme clusters. The unicode-segmentation crate (maintained by the unicode-rs project) provides UAX #29 grapheme cluster segmentation:

use unicode_segmentation::UnicodeSegmentation;

let s = "e\u{0301}";          // é as e + combining acute
let graphemes: Vec<&str> = s.graphemes(true).collect();
assert_eq!(graphemes, vec!["é"]);
assert_eq!(s.chars().count(), 2);       // 2 scalar values
assert_eq!(graphemes.len(), 1);         // 1 grapheme cluster

// Family emoji
let family = "👨\u{200D}👩\u{200D}👧\u{200D}👦";
let g: Vec<&str> = family.graphemes(true).collect();
assert_eq!(g.len(), 1);  // 1 visible character

Normalization

The unicode-normalization crate provides NFC, NFD, NFKC, and NFKD:

use unicode_normalization::UnicodeNormalization;

let nfd = "e\u{0301}";                     // NFD
let nfc: String = nfd.nfc().collect();      // NFC
assert_eq!(nfc, "\u{00E9}");

// Check normalization form
use unicode_normalization::is_nfc;
assert!(is_nfc("\u{00E9}"));
assert!(!is_nfc("e\u{0301}"));

Regular Expressions

The regex crate handles Unicode by default:

use regex::Regex;

let re = Regex::new(r"\p{L}+").unwrap();    // Unicode letters
let matches: Vec<&str> = re.find_iter("café 日本語 Москва")
    .map(|m| m.as_str())
    .collect();
// ["café", "日本語", "Москва"]

// Script-specific matching
Regex::new(r"\p{Greek}+").unwrap();
Regex::new(r"\p{Cyrillic}+").unwrap();
Regex::new(r"\p{Han}+").unwrap();

Common Pitfalls

1. Indexing by Integer

Rust does not allow s[0] to return a character. The Index<usize> trait is not implemented for String or &str:

let s = "hello";
// let c = s[0];   // ERROR: String cannot be indexed by usize
let c = s.chars().nth(0);   // Some('h')

This is deliberate: indexing should be O(1), but finding the nth character in a UTF-8 string is O(n).

2. Byte Length Surprises

"A".len()       // 1
"é".len()       // 2
"世".len()      // 3
"😀".len()      // 4

3. Case Conversion Produces Multiple Characters

// ß uppercases to SS (two characters)
let upper: String = "straße".to_uppercase();
assert_eq!(upper, "STRASSE");
assert_eq!("straße".len(), 6);    // 6 bytes
assert_eq!(upper.len(), 7);       // 7 bytes (one more character)

Quick Reference

Task Code
Byte length s.len()
Char count s.chars().count()
Iterate chars for c in s.chars() { }
Iterate with indices for (i, c) in s.char_indices() { }
Safe truncate s.char_indices().nth(n).map(|(i,_)| &s[..i])
Char to code point c as u32
Code point to char char::from_u32(0x1F600)
Normalize NFC s.nfc().collect::<String>()
Grapheme clusters s.graphemes(true).collect::<Vec<&str>>()
Regex Unicode Regex::new(r"\p{L}+")
Lossy conversion String::from_utf8_lossy(bytes)

Rust's type system turns Unicode correctness from a discipline into a guarantee. The compiler ensures that String and &str are always valid UTF-8, char is always a valid scalar value, and slicing always respects character boundaries (or panics). The trade-off is that you need external crates for normalization and grapheme segmentation -- but those crates are mature, well-maintained, and widely used throughout the Rust ecosystem.

Ещё в Unicode in Code