Unicode in Swift
Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than code units. This guide explains Swift's Character and String types, their views (UTF-8, UTF-16, Unicode scalars), and how to work with emoji and complex characters.
Swift is arguably the most Unicode-correct mainstream programming language. Its
String type is built from the ground up around Unicode — not as an afterthought
or extension, but as a core design principle. Where other languages expose code
units or code points and leave developers to handle the complexity, Swift's
String operates on extended grapheme clusters by default, matching what
humans perceive as characters. This guide explains Swift's Unicode model, its
multiple string views, and why it counts characters differently from every other
language.
Extended Grapheme Clusters
The defining feature of Swift strings is that each Character value represents
an extended grapheme cluster — a sequence of one or more Unicode scalars
(code points) that together produce a single human-perceived character.
let cafe = "café"
print(cafe.count) // 4 — "é" counts as one Character
let flag = "🇺🇸"
print(flag.count) // 1 — the flag is one grapheme cluster
print(Array(flag.unicodeScalars).count) // 2 — two regional indicators
let family = "👨👩👧👦"
print(family.count) // 1 — one grapheme cluster
print(Array(family.unicodeScalars).count) // 7 — 4 people + 3 ZWJ characters
This design means String.count gives the result users intuitively expect: one
flag emoji is one character, one family emoji is one character, and an accented
letter is one character — regardless of how many code points compose it.
Unicode Scalars vs. Characters
Swift makes a clear distinction between three levels of text representation:
| Level | Type | Description |
|---|---|---|
| Grapheme cluster | Character |
What humans see as a character |
| Unicode scalar | Unicode.Scalar |
A single code point (U+0000–U+D7FF, U+E000–U+10FFFF) |
| Code unit | UInt8 / UInt16 |
Raw encoding unit (UTF-8 or UTF-16) |
A Unicode.Scalar is a single code point. It excludes the surrogate range
(U+D800–U+DFFF) because surrogates are encoding artifacts of UTF-16, not real
characters.
let e_acute: Character = "\u{E9}" // U+00E9 — precomposed é
let e_combining: Character = "\u{65}\u{301}" // U+0065 + U+0301 — e + combining accent
print(e_acute == e_combining) // true — Swift compares by canonical equivalence
Swift performs canonical equivalence comparison by default. Two strings that look the same to humans are considered equal, even if their underlying scalar sequences differ. This is a major advantage over languages like Python or JavaScript, where you must manually normalize before comparing.
String Views
A String provides multiple views for accessing its contents at different levels
of abstraction:
let text = "café🐍"
// Character view (default) — extended grapheme clusters
print(text.count) // 5
for char in text { print(char) } // c, a, f, é, 🐍
// Unicode scalar view — code points
print(text.unicodeScalars.count) // 5 (if é is precomposed)
for scalar in text.unicodeScalars {
print(String(format: "U+%04X", scalar.value))
}
// U+0063, U+0061, U+0066, U+00E9, U+1F40D
// UTF-8 view — raw bytes
print(text.utf8.count) // 8 (c=1, a=1, f=1, é=2, 🐍=4... wait)
print(Array(text.utf8))
// [99, 97, 102, 195, 169, 240, 159, 144, 141] — 9 bytes
// UTF-16 view — 16-bit code units
print(text.utf16.count) // 6 (🐍 needs a surrogate pair)
When to Use Each View
| View | Use Case |
|---|---|
String (characters) |
User-facing text: display, counting what users see |
unicodeScalars |
Working with individual code points, Unicode properties |
utf8 |
Network I/O, file I/O, interop with C APIs, byte-level processing |
utf16 |
Interop with Objective-C NSString, Windows APIs, Java |
String Indexing
Because characters in Swift have variable width (a grapheme cluster can span
multiple code points and multiple bytes), Swift does not support integer
subscript indexing. Instead, it uses opaque String.Index values:
let text = "Hello, 世界! 🌍"
// Start and end
let first = text[text.startIndex] // "H"
let last = text[text.index(before: text.endIndex)] // "🌍"
// Advance by character count
let fifth = text.index(text.startIndex, offsetBy: 4)
print(text[fifth]) // "o"
// Range-based slicing
let start = text.index(text.startIndex, offsetBy: 7)
let end = text.index(start, offsetBy: 2)
let substring = text[start..<end] // "世界"
This design is intentional: it prevents O(1) integer indexing that would silently
break multi-byte characters. Advancing by n characters is O(n), but it is
always correct.
Why Not Integer Indexing?
In C, Java, or JavaScript, str[5] accesses the sixth code unit in constant
time — but this is only meaningful for single-byte or fixed-width encodings. For
UTF-8 text, the fifth character might start at byte 5, byte 7, or byte 20
depending on the preceding characters. Swift makes this cost explicit rather than
hiding it behind an O(n) subscript that looks O(1).
Creating Strings with Unicode
Swift supports several Unicode literal formats:
// Direct character embedding (source files are UTF-8)
let greeting = "こんにちは"
let arrow = "→"
// Unicode scalar escape: \u{XXXX}
let snowman = "\u{2603}" // ☃
let snake = "\u{1F40D}" // 🐍
let sigma = "\u{03C3}" // σ
// Multi-scalar characters
let flag_kr = "\u{1F1F0}\u{1F1F7}" // 🇰🇷
let accent_e = "\u{65}\u{301}" // é (combining)
String Comparison and Sorting
Swift's == operator performs canonical equivalence comparison:
let a = "caf\u{E9}" // café (precomposed)
let b = "cafe\u{301}" // café (decomposed)
print(a == b) // true
// Case-insensitive comparison
print(a.caseInsensitiveCompare(b) == .orderedSame) // true
For locale-aware sorting, use localizedStandardCompare:
let words = ["Äpfel", "Orangen", "Bananen"]
let sorted = words.sorted { $0.localizedStandardCompare($1) == .orderedAscending }
// ["Äpfel", "Bananen", "Orangen"] — locale-aware
Working with Unicode Properties
The Unicode.Scalar.Properties API (Swift 5.0+) gives access to the full Unicode
Character Database:
let scalar: Unicode.Scalar = "A"
print(scalar.properties.generalCategory) // .uppercaseLetter
print(scalar.properties.isAlphabetic) // true
print(scalar.properties.isUppercase) // true
print(scalar.properties.name!) // "LATIN CAPITAL LETTER A"
print(scalar.properties.numericType) // nil
let digit: Unicode.Scalar = "7"
print(digit.properties.numericValue) // 7.0
let emoji: Unicode.Scalar = "\u{1F40D}"
print(emoji.properties.isEmoji) // true
print(emoji.properties.name!) // "SNAKE"
Regular Expressions (Swift 5.7+)
Swift 5.7 introduced native regex support with full Unicode awareness:
// Regex literal syntax
let pattern = /\p{Letter}+/
let match = "café".firstMatch(of: pattern)
print(match?.output) // "café"
// Unicode script matching
let hanPattern = /\p{Script=Han}+/
"漢字テスト".firstMatch(of: hanPattern)?.output // "漢字"
// Grapheme-cluster-aware . matching
let dotPattern = /^.$/
"🇺🇸".firstMatch(of: dotPattern) // matches — one grapheme cluster
// Named captures with Unicode
let namePattern = /(?<word>\p{Letter}+)/
Interoperability
With Objective-C (NSString)
NSString is internally UTF-16. When bridging, Swift handles conversion
automatically, but NSString.length counts UTF-16 code units:
import Foundation
let text: NSString = "café🐍"
print(text.length) // 6 — UTF-16 code units (🐍 = surrogate pair)
print((text as String).count) // 5 — Swift characters
With C APIs
Use withCString for null-terminated UTF-8 C strings:
let text = "Hello, 世界!"
text.withCString { ptr in
// ptr is UnsafePointer<CChar> — null-terminated UTF-8
print(strlen(ptr)) // 13 — byte count
}
Common Pitfalls
1. Assuming Fixed Character Width
// Each of these is one Character in Swift
let a: Character = "A" // 1 scalar
let flag: Character = "🇺🇸" // 2 scalars
let family: Character = "👨👩👧👦" // 7 scalars
let accented: Character = "e\u{301}\u{327}" // 3 scalars (e + accent + cedilla)
2. Converting Between Index Types
Indices from different views are not interchangeable:
let text = "café🐍"
let utf8Index = text.utf8.index(text.utf8.startIndex, offsetBy: 4)
// This is a UTF-8 byte offset, not a character offset
// Use String.Index conversions to translate between views
if let charIndex = utf8Index.samePosition(in: text) {
print(text[charIndex]) // "é"
}
3. Performance of count
String.count is O(n) because it must walk the entire string to count grapheme
clusters. If you need the count repeatedly, cache it.
Quick Reference
| Task | Code |
|---|---|
| Character count | str.count |
| Unicode scalar count | str.unicodeScalars.count |
| UTF-8 byte count | str.utf8.count |
| Get character by offset | str[str.index(str.startIndex, offsetBy: n)] |
| Unicode escape | "\u{1F40D}" → "🐍" |
| Canonical comparison | a == b (automatic) |
| Check properties | scalar.properties.isAlphabetic |
| Regex with Unicode | /\p{Letter}+/ |
| To UTF-8 bytes | Array(str.utf8) |
| From Unicode scalar | String(Unicode.Scalar(0x2192)!) → "→" |
Swift's approach to Unicode is more correct than any other mainstream language. The tradeoffs — O(n) character counting, opaque indices, and no integer subscripting — are deliberate design choices that prevent the class of bugs that plague string handling in C, Java, JavaScript, and Python. If you are writing text-processing code, Swift's model is the one to study.
เพิ่มเติมใน Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …