💻 Unicode in Code

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than code units. This guide explains Swift's Character and String types, their views (UTF-8, UTF-16, Unicode scalars), and how to work with emoji and complex characters.

·

Swift is arguably the most Unicode-correct mainstream programming language. Its String type is built from the ground up around Unicode — not as an afterthought or extension, but as a core design principle. Where other languages expose code units or code points and leave developers to handle the complexity, Swift's String operates on extended grapheme clusters by default, matching what humans perceive as characters. This guide explains Swift's Unicode model, its multiple string views, and why it counts characters differently from every other language.

Extended Grapheme Clusters

The defining feature of Swift strings is that each Character value represents an extended grapheme cluster — a sequence of one or more Unicode scalars (code points) that together produce a single human-perceived character.

let cafe = "café"
print(cafe.count)           // 4 — "é" counts as one Character

let flag = "🇺🇸"
print(flag.count)           // 1 — the flag is one grapheme cluster
print(Array(flag.unicodeScalars).count) // 2 — two regional indicators

let family = "👨‍👩‍👧‍👦"
print(family.count)          // 1 — one grapheme cluster
print(Array(family.unicodeScalars).count) // 7 — 4 people + 3 ZWJ characters

This design means String.count gives the result users intuitively expect: one flag emoji is one character, one family emoji is one character, and an accented letter is one character — regardless of how many code points compose it.

Unicode Scalars vs. Characters

Swift makes a clear distinction between three levels of text representation:

Level Type Description
Grapheme cluster Character What humans see as a character
Unicode scalar Unicode.Scalar A single code point (U+0000–U+D7FF, U+E000–U+10FFFF)
Code unit UInt8 / UInt16 Raw encoding unit (UTF-8 or UTF-16)

A Unicode.Scalar is a single code point. It excludes the surrogate range (U+D800–U+DFFF) because surrogates are encoding artifacts of UTF-16, not real characters.

let e_acute: Character = "\u{E9}"            // U+00E9 — precomposed é
let e_combining: Character = "\u{65}\u{301}" // U+0065 + U+0301 — e + combining accent

print(e_acute == e_combining)   // true — Swift compares by canonical equivalence

Swift performs canonical equivalence comparison by default. Two strings that look the same to humans are considered equal, even if their underlying scalar sequences differ. This is a major advantage over languages like Python or JavaScript, where you must manually normalize before comparing.

String Views

A String provides multiple views for accessing its contents at different levels of abstraction:

let text = "café🐍"

// Character view (default) — extended grapheme clusters
print(text.count)                        // 5
for char in text { print(char) }         // c, a, f, é, 🐍

// Unicode scalar view — code points
print(text.unicodeScalars.count)         // 5 (if é is precomposed)
for scalar in text.unicodeScalars {
    print(String(format: "U+%04X", scalar.value))
}
// U+0063, U+0061, U+0066, U+00E9, U+1F40D

// UTF-8 view — raw bytes
print(text.utf8.count)                   // 8 (c=1, a=1, f=1, é=2, 🐍=4... wait)
print(Array(text.utf8))
// [99, 97, 102, 195, 169, 240, 159, 144, 141] — 9 bytes

// UTF-16 view — 16-bit code units
print(text.utf16.count)                  // 6 (🐍 needs a surrogate pair)

When to Use Each View

View Use Case
String (characters) User-facing text: display, counting what users see
unicodeScalars Working with individual code points, Unicode properties
utf8 Network I/O, file I/O, interop with C APIs, byte-level processing
utf16 Interop with Objective-C NSString, Windows APIs, Java

String Indexing

Because characters in Swift have variable width (a grapheme cluster can span multiple code points and multiple bytes), Swift does not support integer subscript indexing. Instead, it uses opaque String.Index values:

let text = "Hello, 世界! 🌍"

// Start and end
let first = text[text.startIndex]           // "H"
let last = text[text.index(before: text.endIndex)]  // "🌍"

// Advance by character count
let fifth = text.index(text.startIndex, offsetBy: 4)
print(text[fifth])                          // "o"

// Range-based slicing
let start = text.index(text.startIndex, offsetBy: 7)
let end = text.index(start, offsetBy: 2)
let substring = text[start..<end]           // "世界"

This design is intentional: it prevents O(1) integer indexing that would silently break multi-byte characters. Advancing by n characters is O(n), but it is always correct.

Why Not Integer Indexing?

In C, Java, or JavaScript, str[5] accesses the sixth code unit in constant time — but this is only meaningful for single-byte or fixed-width encodings. For UTF-8 text, the fifth character might start at byte 5, byte 7, or byte 20 depending on the preceding characters. Swift makes this cost explicit rather than hiding it behind an O(n) subscript that looks O(1).

Creating Strings with Unicode

Swift supports several Unicode literal formats:

// Direct character embedding (source files are UTF-8)
let greeting = "こんにちは"
let arrow = "→"

// Unicode scalar escape: \u{XXXX}
let snowman = "\u{2603}"          // ☃
let snake = "\u{1F40D}"           // 🐍
let sigma = "\u{03C3}"            // σ

// Multi-scalar characters
let flag_kr = "\u{1F1F0}\u{1F1F7}"     // 🇰🇷
let accent_e = "\u{65}\u{301}"          // é (combining)

String Comparison and Sorting

Swift's == operator performs canonical equivalence comparison:

let a = "caf\u{E9}"            // café (precomposed)
let b = "cafe\u{301}"          // café (decomposed)
print(a == b)                   // true

// Case-insensitive comparison
print(a.caseInsensitiveCompare(b) == .orderedSame)  // true

For locale-aware sorting, use localizedStandardCompare:

let words = ["Äpfel", "Orangen", "Bananen"]
let sorted = words.sorted { $0.localizedStandardCompare($1) == .orderedAscending }
// ["Äpfel", "Bananen", "Orangen"] — locale-aware

Working with Unicode Properties

The Unicode.Scalar.Properties API (Swift 5.0+) gives access to the full Unicode Character Database:

let scalar: Unicode.Scalar = "A"
print(scalar.properties.generalCategory)     // .uppercaseLetter
print(scalar.properties.isAlphabetic)        // true
print(scalar.properties.isUppercase)         // true
print(scalar.properties.name!)               // "LATIN CAPITAL LETTER A"
print(scalar.properties.numericType)         // nil

let digit: Unicode.Scalar = "7"
print(digit.properties.numericValue)         // 7.0

let emoji: Unicode.Scalar = "\u{1F40D}"
print(emoji.properties.isEmoji)              // true
print(emoji.properties.name!)                // "SNAKE"

Regular Expressions (Swift 5.7+)

Swift 5.7 introduced native regex support with full Unicode awareness:

// Regex literal syntax
let pattern = /\p{Letter}+/
let match = "café".firstMatch(of: pattern)
print(match?.output)   // "café"

// Unicode script matching
let hanPattern = /\p{Script=Han}+/
"漢字テスト".firstMatch(of: hanPattern)?.output  // "漢字"

// Grapheme-cluster-aware . matching
let dotPattern = /^.$/
"🇺🇸".firstMatch(of: dotPattern)   // matches — one grapheme cluster

// Named captures with Unicode
let namePattern = /(?<word>\p{Letter}+)/

Interoperability

With Objective-C (NSString)

NSString is internally UTF-16. When bridging, Swift handles conversion automatically, but NSString.length counts UTF-16 code units:

import Foundation

let text: NSString = "café🐍"
print(text.length)                 // 6 — UTF-16 code units (🐍 = surrogate pair)
print((text as String).count)      // 5 — Swift characters

With C APIs

Use withCString for null-terminated UTF-8 C strings:

let text = "Hello, 世界!"
text.withCString { ptr in
    // ptr is UnsafePointer<CChar> — null-terminated UTF-8
    print(strlen(ptr))   // 13 — byte count
}

Common Pitfalls

1. Assuming Fixed Character Width

// Each of these is one Character in Swift
let a: Character = "A"                          // 1 scalar
let flag: Character = "🇺🇸"                    // 2 scalars
let family: Character = "👨‍👩‍👧‍👦"             // 7 scalars
let accented: Character = "e\u{301}\u{327}"     // 3 scalars (e + accent + cedilla)

2. Converting Between Index Types

Indices from different views are not interchangeable:

let text = "café🐍"
let utf8Index = text.utf8.index(text.utf8.startIndex, offsetBy: 4)
// This is a UTF-8 byte offset, not a character offset
// Use String.Index conversions to translate between views
if let charIndex = utf8Index.samePosition(in: text) {
    print(text[charIndex])   // "é"
}

3. Performance of count

String.count is O(n) because it must walk the entire string to count grapheme clusters. If you need the count repeatedly, cache it.

Quick Reference

Task Code
Character count str.count
Unicode scalar count str.unicodeScalars.count
UTF-8 byte count str.utf8.count
Get character by offset str[str.index(str.startIndex, offsetBy: n)]
Unicode escape "\u{1F40D}" → "🐍"
Canonical comparison a == b (automatic)
Check properties scalar.properties.isAlphabetic
Regex with Unicode /\p{Letter}+/
To UTF-8 bytes Array(str.utf8)
From Unicode scalar String(Unicode.Scalar(0x2192)!) → "→"

Swift's approach to Unicode is more correct than any other mainstream language. The tradeoffs — O(n) character counting, opaque indices, and no integer subscripting — are deliberate design choices that prevent the class of bugs that plague string handling in C, Java, JavaScript, and Python. If you are writing text-processing code, Swift's model is the one to study.

Thêm trong Unicode in Code