Rust Unicode
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode scalar value (4 bytes). Iteration via .chars() yields code points, .graphemes() requires the unicode-segmentation crate.
What is Rust Unicode Handling?
Rust takes a principled stance on Unicode: all str and String values are guaranteed to be valid UTF-8 at all times, enforced by the type system and runtime checks. This eliminates the class of bugs where a program inadvertently operates on text that is not valid Unicode. The tradeoff is that indexing into a string by position requires care — a byte index is not the same as a code point index or a grapheme cluster index.
str and String as Valid UTF-8
str is Rust's primitive string type — a slice of UTF-8-encoded bytes. String is the owned, heap-allocated version. Neither can contain invalid UTF-8:
let s: &str = "héllo"; // valid UTF-8, always
let owned: String = String::from("世界"); // valid UTF-8, always
// Compile error: cannot index by arbitrary byte position
// let c = s[1]; // ERROR: cannot index into a str
// Correct: get a character by byte offset (only at char boundaries)
let c = &s[2..4]; // byte slice — only safe at known boundaries
char as a Unicode Scalar Value
Rust's char type holds exactly one Unicode scalar value — any code point from U+0000 to U+10FFFF, excluding surrogate code points (U+D800 to U+DFFF). This is 21 bits, stored in 4 bytes (u32 internally):
let heart: char = '\u{2764}'; // ❤ HEAVY BLACK HEART
let emoji: char = '\u{1F600}'; // 😀 GRINNING FACE — valid char in Rust
heart as u32 // 10084
.chars() vs .bytes() vs .graphemes()
Rust's str exposes three levels of iteration:
let s = "café";
// bytes() — raw UTF-8 byte values
s.bytes().count(); // 5 (c=1, a=1, f=1, é=2 bytes)
// chars() — Unicode scalar values (code points)
s.chars().count(); // 4 (c, a, f, é)
// graphemes() — user-perceived characters (requires unicode-segmentation crate)
use unicode_segmentation::UnicodeSegmentation;
s.graphemes(true).count(); // 4 (same here, but differs for combined sequences)
let family = "👨👩👧";
family.chars().count(); // 5 code points (3 people + 2 ZWJ)
family.graphemes(true).count(); // 1 grapheme cluster
The unicode-segmentation crate implements UAX#29 grapheme cluster boundaries, word boundaries, and sentence boundaries. Add it to Cargo.toml:
[dependencies]
unicode-segmentation = "1.10"
Pattern Matching on chars
Rust's string methods accept closures on char for flexible searching:
let s = "hello, 世界!";
s.contains(|c: char| c.is_alphabetic()); // true
s.chars().filter(|c| c.is_uppercase()).count(); // 0
// Unicode-aware char classification
'\u{0041}'.is_uppercase() // true (A)
'\u{0627}'.is_alphabetic() // true (Arabic Alef)
'\u{0661}'.is_numeric() // true (Arabic-Indic digit one)
Quick Facts
| Feature | Detail |
|---|---|
str / String encoding |
Valid UTF-8, enforced at creation |
char type |
Unicode scalar value (not code unit), 4 bytes |
| Surrogates | Not valid char values (compile error) |
| Byte iteration | .bytes() — raw u8 values |
| Code point iteration | .chars() — Unicode scalar values |
| Grapheme iteration | .graphemes() — from unicode-segmentation crate |
| UAX#29 implementation | unicode-segmentation crate |
| Unicode properties | unicode-properties crate, or char methods |
| Normalization | unicode-normalization crate (NFC, NFD, NFKC, NFKD) |
関連用語
プログラミングと開発 のその他の用語
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
ソースコードでUnicode文字を表す構文。言語によって異なります:\u2713(Python/Java/JS)・\u{2713}(JS/Ruby/Rust)・\U00012345(Python/C)。
Unicodeプロパティを使う正規表現パターン:\p{L}(任意の文字)・\p{Script=Greek}(ギリシャ語スクリプト)・\p{Emoji}。言語や正規表現エンジンによってサポートが異なります。
エンコーディングは文字をバイトに変換し(str.encode('utf-8'))、デコーディングはバイトを文字に変換します(bytes.decode('utf-8'))。正しく行えば文字化けを防げます。
UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位(上位サロゲートU+D800〜U+DBFF + 下位サロゲートU+DC00〜U+DFFF)。😀 = D83D DE00。
U+0000(NUL)。最初のUnicode/ASCII文字で、C/C++では文字列ターミネータとして使われます。セキュリティリスク:ヌルバイト挿入は脆弱なシステムで文字列を切り捨てる可能性があります。
目に見えるグリフを持たない文字:空白・ゼロ幅文字・制御文字・書式文字。スプーフィングやテキスト密輸などのセキュリティ問題を引き起こす可能性があります。
プログラミング言語における文字のシーケンス。内部表現はさまざまです:UTF-8(Go・Rust・新しいPython)・UTF-16(Java・JavaScript・C#)・UTF-32(Python)。
Unicodeの文字列の「長さ」は単位によって異なります:コード単位(JavaScript .length)・コードポイント(Python len())・書記素クラスター。👨👩👧👦 = 7コードポイント、1書記素。