Unicode in Rust
Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode text handling at the type level. This guide explains Rust's Unicode guarantees, how to iterate over characters and bytes, and the unicode-segmentation crate for grapheme cluster support.
Rust takes an uncompromising stance on Unicode: the String type and &str
slices are guaranteed to contain valid UTF-8 at all times. The compiler enforces
this invariant, and attempting to create a String from invalid bytes is a
compile-time or runtime error -- never silently accepted. This makes Rust one of
the safest languages for Unicode text processing, but it also means that string
operations behave differently from languages like C or Python.
Strings Are Valid UTF-8
Rust has two primary string types:
| Type | Ownership | Guarantee |
|---|---|---|
String |
Owned, heap-allocated, growable | Valid UTF-8 |
&str |
Borrowed reference (string slice) | Valid UTF-8 |
let s: String = String::from("Hello, 世界");
let slice: &str = "Hello, 世界"; // string literal → &str
// Both are guaranteed valid UTF-8
assert!(std::str::from_utf8(s.as_bytes()).is_ok());
There is no way to construct a String or &str containing invalid UTF-8
through safe Rust. If you have arbitrary bytes, you must explicitly validate
or convert them:
let bytes: &[u8] = b"\xC3\xA9"; // valid UTF-8 for é
let text = std::str::from_utf8(bytes).unwrap(); // Ok("é")
let bad: &[u8] = b"\xFF\xFE";
let result = std::str::from_utf8(bad); // Err(Utf8Error)
The char Type: A Unicode Scalar Value
Rust's char is a 32-bit type representing a Unicode scalar value -- any
code point except surrogates (U+D800 to U+DFFF):
let c: char = 'é';
println!("{}", c as u32); // 233 (U+00E9)
println!("{:?}", c); // 'é'
let emoji: char = '😀';
println!("U+{:04X}", emoji as u32); // U+1F600
// char is always 4 bytes in memory
assert_eq!(std::mem::size_of::<char>(), 4);
This means char can represent any Unicode code point including supplementary
characters. There is no surrogate-pair confusion as in Java or JavaScript.
char Methods
Rust's char has rich built-in methods:
'A'.is_alphabetic() // true
'7'.is_numeric() // true
'σ'.is_lowercase() // true
'Σ'.is_uppercase() // true
' '.is_whitespace() // true
'\t'.is_control() // true (U+0009 is a control character)
// Case conversion returns an iterator (because some conversions
// produce multiple characters, e.g., German ß → SS)
let upper: String = 'ß'.to_uppercase().collect();
assert_eq!(upper, "SS");
'A'.to_lowercase().collect::<String>() // "a"
The fact that to_uppercase() returns an iterator instead of a single char
reflects Rust's commitment to correctness: the German lowercase ß uppercases
to the two-character sequence SS, and a single char cannot hold that.
String Length: Bytes vs. Chars vs. Graphemes
Rust offers multiple ways to measure a string, and they give different answers:
let s = "Hello, 世界! 👋";
// Byte length (O(1), always available)
println!("{}", s.len()); // 19 bytes
// Character (scalar value) count (O(n), must iterate)
println!("{}", s.chars().count()); // 11 characters
// For grapheme clusters, use the unicode-segmentation crate
| Metric | Method | Value for "Hello, 世界! 👋" |
|---|---|---|
| Bytes | .len() |
19 |
| Chars (scalar values) | .chars().count() |
11 |
| Grapheme clusters | UnicodeSegmentation |
11 |
For text containing combining characters, emoji sequences (ZWJ families, flags), or variation selectors, char count and grapheme count diverge significantly.
Iterating Over Strings
Rust provides three built-in iteration methods:
let s = "café";
// Iterate over characters (Unicode scalar values)
for c in s.chars() {
print!("{} ", c); // c a f é
}
// Iterate over bytes
for b in s.bytes() {
print!("{:02X} ", b); // 63 61 66 C3 A9
}
// Iterate over character indices (byte offsets)
for (i, c) in s.char_indices() {
println!("byte {} → '{}'", i, c);
}
// byte 0 → 'c'
// byte 1 → 'a'
// byte 2 → 'f'
// byte 3 → 'é' (byte offset 3, not 4)
String Slicing: The Byte Boundary Rule
You can slice a &str with byte ranges, but the slice must fall on
UTF-8 character boundaries or Rust will panic at runtime:
let s = "café";
// OK: slicing at character boundaries
let caf = &s[0..3]; // "caf" (bytes 0, 1, 2)
let e_accent = &s[3..5]; // "é" (bytes 3, 4)
// PANIC: byte 4 is in the middle of the é sequence
// let bad = &s[0..4]; // thread panics: "byte index 4 is not a char boundary"
This compile-time-safe / runtime-checked design prevents the creation of
invalid UTF-8 strings but requires you to know your byte boundaries. Use
char_indices() to find safe split points:
fn safe_truncate(s: &str, max_chars: usize) -> &str {
match s.char_indices().nth(max_chars) {
Some((byte_idx, _)) => &s[..byte_idx],
None => s,
}
}
safe_truncate("café", 3) // "caf"
safe_truncate("日本語", 2) // "日本"
Building Strings
// From a literal
let s = String::from("hello");
// Push characters
let mut s = String::new();
s.push('H');
s.push('é');
s.push_str("llo"); // append a &str
assert_eq!(s, "Héllo");
// From code point (integer → char → String)
let c = char::from_u32(0x1F600).unwrap(); // 😀
let s = c.to_string();
// Format macro
let s = format!("Hello {} U+{:04X}", '世', '世' as u32);
Working with Bytes: OsString, CString, and Vec
Not all text in the real world is valid UTF-8. Rust provides separate types for these cases:
| Type | Use Case | Guarantee |
|---|---|---|
String / &str |
General text | Valid UTF-8 |
OsString / &OsStr |
File paths, OS interfaces | Platform-dependent |
CString / &CStr |
C interop (null-terminated) | No internal NULs |
Vec<u8> / &[u8] |
Arbitrary bytes | None |
use std::ffi::OsStr;
use std::path::Path;
// File paths may not be valid UTF-8 on Unix
let path = Path::new("/tmp/café");
let os_str: &OsStr = path.as_os_str();
// Lossy conversion (replaces invalid sequences with U+FFFD)
let lossy: String = os_str.to_string_lossy().into_owned();
The unicode-segmentation Crate
The standard library handles code points but not grapheme clusters. The
unicode-segmentation crate (maintained by the unicode-rs project) provides
UAX #29 grapheme cluster segmentation:
use unicode_segmentation::UnicodeSegmentation;
let s = "e\u{0301}"; // é as e + combining acute
let graphemes: Vec<&str> = s.graphemes(true).collect();
assert_eq!(graphemes, vec!["é"]);
assert_eq!(s.chars().count(), 2); // 2 scalar values
assert_eq!(graphemes.len(), 1); // 1 grapheme cluster
// Family emoji
let family = "👨\u{200D}👩\u{200D}👧\u{200D}👦";
let g: Vec<&str> = family.graphemes(true).collect();
assert_eq!(g.len(), 1); // 1 visible character
Normalization
The unicode-normalization crate provides NFC, NFD, NFKC, and NFKD:
use unicode_normalization::UnicodeNormalization;
let nfd = "e\u{0301}"; // NFD
let nfc: String = nfd.nfc().collect(); // NFC
assert_eq!(nfc, "\u{00E9}");
// Check normalization form
use unicode_normalization::is_nfc;
assert!(is_nfc("\u{00E9}"));
assert!(!is_nfc("e\u{0301}"));
Regular Expressions
The regex crate handles Unicode by default:
use regex::Regex;
let re = Regex::new(r"\p{L}+").unwrap(); // Unicode letters
let matches: Vec<&str> = re.find_iter("café 日本語 Москва")
.map(|m| m.as_str())
.collect();
// ["café", "日本語", "Москва"]
// Script-specific matching
Regex::new(r"\p{Greek}+").unwrap();
Regex::new(r"\p{Cyrillic}+").unwrap();
Regex::new(r"\p{Han}+").unwrap();
Common Pitfalls
1. Indexing by Integer
Rust does not allow s[0] to return a character. The Index<usize> trait is
not implemented for String or &str:
let s = "hello";
// let c = s[0]; // ERROR: String cannot be indexed by usize
let c = s.chars().nth(0); // Some('h')
This is deliberate: indexing should be O(1), but finding the nth character in a UTF-8 string is O(n).
2. Byte Length Surprises
"A".len() // 1
"é".len() // 2
"世".len() // 3
"😀".len() // 4
3. Case Conversion Produces Multiple Characters
// ß uppercases to SS (two characters)
let upper: String = "straße".to_uppercase();
assert_eq!(upper, "STRASSE");
assert_eq!("straße".len(), 6); // 6 bytes
assert_eq!(upper.len(), 7); // 7 bytes (one more character)
Quick Reference
| Task | Code |
|---|---|
| Byte length | s.len() |
| Char count | s.chars().count() |
| Iterate chars | for c in s.chars() { } |
| Iterate with indices | for (i, c) in s.char_indices() { } |
| Safe truncate | s.char_indices().nth(n).map(|(i,_)| &s[..i]) |
| Char to code point | c as u32 |
| Code point to char | char::from_u32(0x1F600) |
| Normalize NFC | s.nfc().collect::<String>() |
| Grapheme clusters | s.graphemes(true).collect::<Vec<&str>>() |
| Regex Unicode | Regex::new(r"\p{L}+") |
| Lossy conversion | String::from_utf8_lossy(bytes) |
Rust's type system turns Unicode correctness from a discipline into a guarantee.
The compiler ensures that String and &str are always valid UTF-8, char is
always a valid scalar value, and slicing always respects character boundaries (or
panics). The trade-off is that you need external crates for normalization and
grapheme segmentation -- but those crates are mature, well-maintained, and widely
used throughout the Rust ecosystem.
Unicode in Code में और
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …