Unicode in Go
Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it easier to work with non-ASCII text than many older languages. This guide covers Go's unicode and unicode/utf8 packages, ranging over strings, and handling multibyte characters correctly.
Go was designed in 2007 by Rob Pike, Ken Thompson, and Robert Griesemer at
Google -- the same Rob Pike and Ken Thompson who co-created UTF-8 at Bell Labs
in 1992. It should come as no surprise, then, that Go's approach to Unicode is
elegant and pragmatic: all Go source code is UTF-8, all strings are byte slices
that conventionally hold UTF-8, and the language provides a dedicated rune
type for working with individual code points.
Strings Are UTF-8 Byte Slices
A Go string is a read-only slice of bytes. By convention (and by the behavior
of all standard-library functions), those bytes are UTF-8 encoded:
s := "Hello, 世界"
fmt.Println(len(s)) // 13 (bytes, not characters)
fmt.Println(utf8.RuneCountInString(s)) // 9 (code points)
The len() built-in returns the byte length, not the number of characters.
This distinction is the first thing every Go developer must internalize.
Why Bytes, Not Characters?
Go's design philosophy is that strings are fundamentally sequences of bytes that happen to be valid UTF-8 most of the time. This gives you:
- Zero-cost interop with C libraries and network protocols
- Efficient slicing and concatenation (no re-encoding)
- The ability to handle partially valid or binary data
The trade-off is that you must be explicit when you want to operate on characters (runes) rather than bytes.
The rune Type
Go defines rune as an alias for int32:
type rune = int32
A rune holds a single Unicode code point. The name "rune" was chosen to avoid
confusion with the overloaded term "character" and to give Go a distinctive
vocabulary:
var r rune = '世' // U+4E16
fmt.Printf("U+%04X\n", r) // U+4E16
fmt.Println(string(r)) // 世
Single-quoted literals in Go produce rune values (not byte values as in C):
r1 := 'A' // rune, value 65 (U+0041)
r2 := '🎵' // rune, value 127925 (U+1F3B5)
r3 := '\u00E9' // rune, value 233 (U+00E9, é)
r4 := '\U0001F600' // rune, value 128512 (U+1F600, 😀)
Iterating Over Strings
Go provides two distinct iteration patterns that expose the byte-vs-rune distinction:
Byte Iteration (index-based)
s := "café"
for i := 0; i < len(s); i++ {
fmt.Printf("byte[%d] = %02X\n", i, s[i])
}
// byte[0] = 63 (c)
// byte[1] = 61 (a)
// byte[2] = 66 (f)
// byte[3] = C3 (first byte of é)
// byte[4] = A9 (second byte of é)
Rune Iteration (range loop)
s := "café"
for i, r := range s {
fmt.Printf("index=%d rune=U+%04X char=%c\n", i, r, r)
}
// index=0 rune=U+0063 char=c
// index=1 rune=U+0061 char=a
// index=2 rune=U+0066 char=f
// index=3 rune=U+00E9 char=é
The range loop decodes one UTF-8 rune at each step and advances the index by
the number of bytes consumed. This is the idiomatic way to iterate over
characters in Go.
If the string contains invalid UTF-8, range produces U+FFFD
(REPLACEMENT CHARACTER) for each bad byte.
Converting Between Strings, Runes, and Bytes
s := "Hello, 世界"
// String → rune slice (decode all code points)
runes := []rune(s)
fmt.Println(len(runes)) // 9
fmt.Printf("%U\n", runes[7]) // U+754C (界)
// Rune slice → string
s2 := string(runes)
// String → byte slice
bytes := []byte(s)
fmt.Println(len(bytes)) // 13
// Byte slice → string
s3 := string(bytes)
// Single rune → string
s4 := string('世') // "世"
// Single rune → UTF-8 bytes
buf := make([]byte, 4)
n := utf8.EncodeRune(buf, '世')
fmt.Println(buf[:n]) // [228 184 150]
Converting to []rune allocates a new slice and decodes every rune. For large
strings, prefer range-based iteration over conversion.
The unicode Package
The unicode standard-library package provides classification functions that
mirror the Unicode Character Database:
import "unicode"
unicode.IsLetter('A') // true
unicode.IsLetter('7') // false
unicode.IsDigit('٣') // true (Arabic-Indic digit 3)
unicode.IsSpace(' ') // true
unicode.IsSpace('\u00A0') // true (non-breaking space)
unicode.IsPunct('!') // true
unicode.IsUpper('Σ') // true
unicode.ToLower('Σ') // 'σ'
unicode.ToUpper('é') // 'É'
Script and Category Tables
The unicode package exports tables for every Unicode script and general
category:
unicode.Is(unicode.Greek, 'Σ') // true
unicode.Is(unicode.Cyrillic, 'Д') // true
unicode.Is(unicode.Han, '漢') // true
// Check if a rune is a Unicode symbol
unicode.Is(unicode.S, '€') // true (S = Symbol)
unicode.Is(unicode.Sm, '+') // true (Sm = Symbol, math)
The unicode/utf8 Package
The unicode/utf8 package provides low-level UTF-8 encoding and decoding:
import "unicode/utf8"
s := "Hello, 世界"
// Count runes
utf8.RuneCountInString(s) // 9
// Decode first rune
r, size := utf8.DecodeRuneInString(s)
// r = 'H', size = 1
// Decode last rune
r, size = utf8.DecodeLastRuneInString(s)
// r = '界', size = 3
// Check validity
utf8.ValidString(s) // true
utf8.ValidString("\xFF\xFE") // false
// Rune byte length
utf8.RuneLen('A') // 1
utf8.RuneLen('世') // 3
utf8.RuneLen('🎵') // 4
UTF-8 Byte Lengths
| Code Point Range | UTF-8 Bytes | Example |
|---|---|---|
| U+0000 -- U+007F | 1 | A (0x41) |
| U+0080 -- U+07FF | 2 | é (0xC3 0xA9) |
| U+0800 -- U+FFFF | 3 | 世 (0xE4 0xB8 0x96) |
| U+10000 -- U+10FFFF | 4 | 🎵 (0xF0 0x9F 0x8E 0xB5) |
The strings Package
Go's strings package is byte-oriented but provides functions that work
correctly with UTF-8:
import "strings"
strings.ToUpper("café") // "CAFÉ"
strings.ToLower("ΣΕΛΉΝΗ") // "σελήνη"
strings.Contains("Hello 🌍", "🌍") // true
strings.Count("banana", "a") // 3
strings.TrimSpace(" hello\t") // "hello"
// Map transforms each rune
rot13 := strings.Map(func(r rune) rune {
if r >= 'a' && r <= 'z' {
return 'a' + (r-'a'+13)%26
}
return r
}, "hello")
// "uryyb"
String Slicing Gotchas
Because strings are byte slices, slicing at an arbitrary byte offset can split a multi-byte UTF-8 sequence:
s := "café"
// WRONG: slicing in the middle of é (bytes 3-4)
bad := s[:4] // "caf\xC3" ← invalid UTF-8
// CORRECT: find the right byte boundary
runes := []rune(s)
good := string(runes[:3]) // "caf"
If you know your text is ASCII-only, byte slicing is safe and efficient.
Otherwise, convert to []rune first, or use range to find code point
boundaries.
Normalization
The standard library does not include Unicode normalization, but the official
golang.org/x/text module provides it:
import "golang.org/x/text/unicode/norm"
composed := "e\u0301" // NFD: e + combining acute
normalized := norm.NFC.String(composed)
fmt.Println(normalized == "\u00E9") // true
// Stream-based normalization
reader := norm.NFC.Reader(strings.NewReader(composed))
The golang.org/x/text module also provides:
transform-- streaming text transformationscollate-- locale-aware sortinglanguage-- BCP 47 language tag parsingcases-- Unicode-aware case conversionwidth-- East Asian width folding
Regular Expressions
Go's regexp package uses the RE2 engine, which handles UTF-8 natively:
import "regexp"
// \\pL matches any Unicode letter
re := regexp.MustCompile(`\pL+`)
re.FindAllString("café 日本語 Москва", -1)
// ["café", "日本語", "Москва"]
// Named Unicode categories
regexp.MustCompile(`\p{Greek}+`) // Greek letters
regexp.MustCompile(`\p{Cyrillic}+`) // Cyrillic letters
regexp.MustCompile(`\p{Han}+`) // CJK ideographs
regexp.MustCompile(`\p{Nd}+`) // decimal digits (any script)
Grapheme Clusters
A single visible character can span multiple runes (e.g., flag emoji, skin-tone
modified emoji, combined diacritics). Go's standard library does not provide
grapheme cluster segmentation, but golang.org/x/text/unicode/norm and
third-party packages like github.com/rivo/uniseg handle this:
import "github.com/rivo/uniseg"
s := "👨👩👧👦" // family emoji (7 runes, 1 grapheme cluster)
fmt.Println(utf8.RuneCountInString(s)) // 7
fmt.Println(uniseg.GraphemeClusterCount(s)) // 1
Quick Reference
| Task | Code |
|---|---|
| Byte length | len(s) |
| Rune count | utf8.RuneCountInString(s) |
| Iterate runes | for i, r := range s { ... } |
| String to runes | []rune(s) |
| Rune to string | string(r) |
| Classify rune | unicode.IsLetter(r) |
| Check script | unicode.Is(unicode.Greek, r) |
| Validate UTF-8 | utf8.ValidString(s) |
| Normalize NFC | norm.NFC.String(s) |
| Regex Unicode | regexp.MustCompile(\pL+) |
Go's model -- UTF-8 strings by convention, explicit rune type for code
points, and powerful standard-library packages -- makes Unicode handling both
efficient and safe. The key discipline is to remember that len() counts bytes,
range iterates runes, and neither one counts grapheme clusters. When you need
grapheme-level accuracy, reach for golang.org/x/text or a segmentation
library.
Más en Unicode in Code
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
Ruby strings carry an explicit encoding, with UTF-8 being the default since …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …