What is Karakter dizisi?

Bir programlama dilinde karakter dizisi. Dahili gösterim değişir: UTF-8 (Go, Rust, yeni Python sürümleri), UTF-16 (Java, JavaScript, C#) veya UTF-32 (Python).

Karakter başına 1–4 byte kullanan değişken uzunluklu Unicode kodlaması. Web'in baskın kodlaması (%98+ web sitesi) ve tam ASCII geriye uyumluluğu sağlar.

What is Unicode skaler değeri?

Vekil kod noktaları (U+D800–U+DFFF) hariç herhangi bir kod noktası. Gerçek karakterleri temsil edebilen geçerli değerler kümesi, toplam 1.112.064 değer.

What is Grafem kümesi?

Kullanıcının algıladığı 'karakter' — tek bir birim gibi hissettiren öğe. Birden fazla kod noktasından oluşabilir (taban + birleştirme işaretleri veya emoji ZWJ dizileri). 👩💻 = 3 kod noktası, 1 grafem.

Programlama ve Geliştirme

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode scalar value (4 bytes). Iteration via .chars() yields code points, .graphemes() requires the unicode-segmentation crate.

What is Rust Unicode Handling?

Rust takes a principled stance on Unicode: all str and String values are guaranteed to be valid UTF-8 at all times, enforced by the type system and runtime checks. This eliminates the class of bugs where a program inadvertently operates on text that is not valid Unicode. The tradeoff is that indexing into a string by position requires care — a byte index is not the same as a code point index or a grapheme cluster index.

str and String as Valid UTF-8

str is Rust's primitive string type — a slice of UTF-8-encoded bytes. String is the owned, heap-allocated version. Neither can contain invalid UTF-8:

let s: &str = "héllo";        // valid UTF-8, always
let owned: String = String::from("世界");  // valid UTF-8, always

// Compile error: cannot index by arbitrary byte position
// let c = s[1];  // ERROR: cannot index into a str

// Correct: get a character by byte offset (only at char boundaries)
let c = &s[2..4];  // byte slice — only safe at known boundaries

char as a Unicode Scalar Value

Rust's char type holds exactly one Unicode scalar value — any code point from U+0000 to U+10FFFF, excluding surrogate code points (U+D800 to U+DFFF). This is 21 bits, stored in 4 bytes (u32 internally):

let heart: char = '\u{2764}';  // ❤ HEAVY BLACK HEART
let emoji: char = '\u{1F600}'; // 😀 GRINNING FACE — valid char in Rust

heart as u32  // 10084

.chars() vs .bytes() vs .graphemes()

Rust's str exposes three levels of iteration:

let s = "café";

// bytes() — raw UTF-8 byte values
s.bytes().count();   // 5 (c=1, a=1, f=1, é=2 bytes)

// chars() — Unicode scalar values (code points)
s.chars().count();   // 4 (c, a, f, é)

// graphemes() — user-perceived characters (requires unicode-segmentation crate)
use unicode_segmentation::UnicodeSegmentation;
s.graphemes(true).count();  // 4 (same here, but differs for combined sequences)

let family = "👨‍👩‍👧";
family.chars().count();      // 5 code points (3 people + 2 ZWJ)
family.graphemes(true).count(); // 1 grapheme cluster

The unicode-segmentation crate implements UAX#29 grapheme cluster boundaries, word boundaries, and sentence boundaries. Add it to Cargo.toml:

[dependencies]
unicode-segmentation = "1.10"

Pattern Matching on chars

Rust's string methods accept closures on char for flexible searching:

let s = "hello, 世界!";

s.contains(|c: char| c.is_alphabetic());   // true
s.chars().filter(|c| c.is_uppercase()).count();  // 0

// Unicode-aware char classification
'\u{0041}'.is_uppercase()  // true (A)
'\u{0627}'.is_alphabetic() // true (Arabic Alef)
'\u{0661}'.is_numeric()    // true (Arabic-Indic digit one)

Quick Facts

Feature	Detail
`str` / `String` encoding	Valid UTF-8, enforced at creation
`char` type	Unicode scalar value (not code unit), 4 bytes
Surrogates	Not valid `char` values (compile error)
Byte iteration	`.bytes()` — raw u8 values
Code point iteration	`.chars()` — Unicode scalar values
Grapheme iteration	`.graphemes()` — from `unicode-segmentation` crate
UAX#29 implementation	`unicode-segmentation` crate
Unicode properties	`unicode-properties` crate, or `char` methods
Normalization	`unicode-normalization` crate (NFC, NFD, NFKC, NFKD)

İlgili Terimler

Karakter dizisi UTF-8 Unicode skaler değeri Grafem kümesi

Programlama ve Geliştirme içinde daha fazlası

Boş karakter

U+0000 (NUL). İlk Unicode/ASCII karakteri, C/C++'da dizi sonlandırıcı olarak kullanılır. Güvenlik riski: …

Dize uzunluğu belirsizliği

Bir Unicode dizisinin 'uzunluğu' birime bağlıdır: kod birimleri (JavaScript .length), kod noktaları …

Görünmez karakter

Görünür glifi olmayan herhangi bir karakter: boşluk, sıfır genişlikli karakterler, kontrol karakterleri …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Karakter dizisi

Bir programlama dilinde karakter dizisi. Dahili gösterim değişir: UTF-8 (Go, Rust, yeni …

Kodlama / Kod çözme

Kodlama karakterleri baytlara dönüştürür (str.encode('utf-8')); kod çözme baytları karakterlere dönüştürür (bytes.decode('utf-8')). Bunu …

Mojibake

Baytların yanlış kodlama ile çözülmesinden kaynaklanan bozuk metin. Japonca terim (文字化け). Örnek: …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Unicode düzenli ifade

Unicode özelliklerini kullanan regex desenleri: \p{L} (herhangi bir harf), \p{Script=Greek} (Yunan alfabesi), …

Unicode kaçış dizisi

Kaynak kodda Unicode karakterlerini temsil etme sözdizimi. Dile göre değişir: \u2713 (Python/Java/JS), …

← Sözlüğe Geri Dön