💻 Unicode in Code

Unicode in Go

Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it easier to work with non-ASCII text than many older languages. This guide covers Go's unicode and unicode/utf8 packages, ranging over strings, and handling multibyte characters correctly.

·

Go was designed in 2007 by Rob Pike, Ken Thompson, and Robert Griesemer at Google -- the same Rob Pike and Ken Thompson who co-created UTF-8 at Bell Labs in 1992. It should come as no surprise, then, that Go's approach to Unicode is elegant and pragmatic: all Go source code is UTF-8, all strings are byte slices that conventionally hold UTF-8, and the language provides a dedicated rune type for working with individual code points.

Strings Are UTF-8 Byte Slices

A Go string is a read-only slice of bytes. By convention (and by the behavior of all standard-library functions), those bytes are UTF-8 encoded:

s := "Hello, 世界"
fmt.Println(len(s))           // 13  (bytes, not characters)
fmt.Println(utf8.RuneCountInString(s))  // 9  (code points)

The len() built-in returns the byte length, not the number of characters. This distinction is the first thing every Go developer must internalize.

Why Bytes, Not Characters?

Go's design philosophy is that strings are fundamentally sequences of bytes that happen to be valid UTF-8 most of the time. This gives you:

  • Zero-cost interop with C libraries and network protocols
  • Efficient slicing and concatenation (no re-encoding)
  • The ability to handle partially valid or binary data

The trade-off is that you must be explicit when you want to operate on characters (runes) rather than bytes.

The rune Type

Go defines rune as an alias for int32:

type rune = int32

A rune holds a single Unicode code point. The name "rune" was chosen to avoid confusion with the overloaded term "character" and to give Go a distinctive vocabulary:

var r rune = '世'             // U+4E16
fmt.Printf("U+%04X\n", r)    // U+4E16
fmt.Println(string(r))        // 世

Single-quoted literals in Go produce rune values (not byte values as in C):

r1 := 'A'        // rune, value 65 (U+0041)
r2 := '🎵'       // rune, value 127925 (U+1F3B5)
r3 := '\u00E9'   // rune, value 233 (U+00E9, é)
r4 := '\U0001F600' // rune, value 128512 (U+1F600, 😀)

Iterating Over Strings

Go provides two distinct iteration patterns that expose the byte-vs-rune distinction:

Byte Iteration (index-based)

s := "café"
for i := 0; i < len(s); i++ {
    fmt.Printf("byte[%d] = %02X\n", i, s[i])
}
// byte[0] = 63  (c)
// byte[1] = 61  (a)
// byte[2] = 66  (f)
// byte[3] = C3  (first byte of é)
// byte[4] = A9  (second byte of é)

Rune Iteration (range loop)

s := "café"
for i, r := range s {
    fmt.Printf("index=%d rune=U+%04X char=%c\n", i, r, r)
}
// index=0 rune=U+0063 char=c
// index=1 rune=U+0061 char=a
// index=2 rune=U+0066 char=f
// index=3 rune=U+00E9 char=é

The range loop decodes one UTF-8 rune at each step and advances the index by the number of bytes consumed. This is the idiomatic way to iterate over characters in Go.

If the string contains invalid UTF-8, range produces U+FFFD (REPLACEMENT CHARACTER) for each bad byte.

Converting Between Strings, Runes, and Bytes

s := "Hello, 世界"

// String → rune slice (decode all code points)
runes := []rune(s)
fmt.Println(len(runes))        // 9
fmt.Printf("%U\n", runes[7])   // U+754C (界)

// Rune slice → string
s2 := string(runes)

// String → byte slice
bytes := []byte(s)
fmt.Println(len(bytes))        // 13

// Byte slice → string
s3 := string(bytes)

// Single rune → string
s4 := string('世')             // "世"

// Single rune → UTF-8 bytes
buf := make([]byte, 4)
n := utf8.EncodeRune(buf, '世')
fmt.Println(buf[:n])           // [228 184 150]

Converting to []rune allocates a new slice and decodes every rune. For large strings, prefer range-based iteration over conversion.

The unicode Package

The unicode standard-library package provides classification functions that mirror the Unicode Character Database:

import "unicode"

unicode.IsLetter('A')          // true
unicode.IsLetter('7')          // false
unicode.IsDigit('٣')           // true  (Arabic-Indic digit 3)
unicode.IsSpace(' ')           // true
unicode.IsSpace('\u00A0')      // true  (non-breaking space)
unicode.IsPunct('!')           // true
unicode.IsUpper('Σ')           // true
unicode.ToLower('Σ')           // 'σ'
unicode.ToUpper('é')           // 'É'

Script and Category Tables

The unicode package exports tables for every Unicode script and general category:

unicode.Is(unicode.Greek, 'Σ')       // true
unicode.Is(unicode.Cyrillic, 'Д')    // true
unicode.Is(unicode.Han, '漢')        // true

// Check if a rune is a Unicode symbol
unicode.Is(unicode.S, '€')          // true  (S = Symbol)
unicode.Is(unicode.Sm, '+')         // true  (Sm = Symbol, math)

The unicode/utf8 Package

The unicode/utf8 package provides low-level UTF-8 encoding and decoding:

import "unicode/utf8"

s := "Hello, 世界"

// Count runes
utf8.RuneCountInString(s)            // 9

// Decode first rune
r, size := utf8.DecodeRuneInString(s)
// r = 'H', size = 1

// Decode last rune
r, size = utf8.DecodeLastRuneInString(s)
// r = '界', size = 3

// Check validity
utf8.ValidString(s)                   // true
utf8.ValidString("\xFF\xFE")          // false

// Rune byte length
utf8.RuneLen('A')                     // 1
utf8.RuneLen('世')                    // 3
utf8.RuneLen('🎵')                    // 4

UTF-8 Byte Lengths

Code Point Range UTF-8 Bytes Example
U+0000 -- U+007F 1 A (0x41)
U+0080 -- U+07FF 2 é (0xC3 0xA9)
U+0800 -- U+FFFF 3 世 (0xE4 0xB8 0x96)
U+10000 -- U+10FFFF 4 🎵 (0xF0 0x9F 0x8E 0xB5)

The strings Package

Go's strings package is byte-oriented but provides functions that work correctly with UTF-8:

import "strings"

strings.ToUpper("café")               // "CAFÉ"
strings.ToLower("ΣΕΛΉΝΗ")             // "σελήνη"
strings.Contains("Hello 🌍", "🌍")    // true
strings.Count("banana", "a")          // 3
strings.TrimSpace(" hello\t")         // "hello"

// Map transforms each rune
rot13 := strings.Map(func(r rune) rune {
    if r >= 'a' && r <= 'z' {
        return 'a' + (r-'a'+13)%26
    }
    return r
}, "hello")
// "uryyb"

String Slicing Gotchas

Because strings are byte slices, slicing at an arbitrary byte offset can split a multi-byte UTF-8 sequence:

s := "café"

// WRONG: slicing in the middle of é (bytes 3-4)
bad := s[:4]        // "caf\xC3"  ← invalid UTF-8

// CORRECT: find the right byte boundary
runes := []rune(s)
good := string(runes[:3])  // "caf"

If you know your text is ASCII-only, byte slicing is safe and efficient. Otherwise, convert to []rune first, or use range to find code point boundaries.

Normalization

The standard library does not include Unicode normalization, but the official golang.org/x/text module provides it:

import "golang.org/x/text/unicode/norm"

composed   := "e\u0301"       // NFD: e + combining acute
normalized := norm.NFC.String(composed)
fmt.Println(normalized == "\u00E9")   // true

// Stream-based normalization
reader := norm.NFC.Reader(strings.NewReader(composed))

The golang.org/x/text module also provides:

  • transform -- streaming text transformations
  • collate -- locale-aware sorting
  • language -- BCP 47 language tag parsing
  • cases -- Unicode-aware case conversion
  • width -- East Asian width folding

Regular Expressions

Go's regexp package uses the RE2 engine, which handles UTF-8 natively:

import "regexp"

// \\pL matches any Unicode letter
re := regexp.MustCompile(`\pL+`)
re.FindAllString("café 日本語 Москва", -1)
// ["café", "日本語", "Москва"]

// Named Unicode categories
regexp.MustCompile(`\p{Greek}+`)      // Greek letters
regexp.MustCompile(`\p{Cyrillic}+`)   // Cyrillic letters
regexp.MustCompile(`\p{Han}+`)        // CJK ideographs
regexp.MustCompile(`\p{Nd}+`)         // decimal digits (any script)

Grapheme Clusters

A single visible character can span multiple runes (e.g., flag emoji, skin-tone modified emoji, combined diacritics). Go's standard library does not provide grapheme cluster segmentation, but golang.org/x/text/unicode/norm and third-party packages like github.com/rivo/uniseg handle this:

import "github.com/rivo/uniseg"

s := "👨‍👩‍👧‍👦"   // family emoji (7 runes, 1 grapheme cluster)
fmt.Println(utf8.RuneCountInString(s))    // 7
fmt.Println(uniseg.GraphemeClusterCount(s)) // 1

Quick Reference

Task Code
Byte length len(s)
Rune count utf8.RuneCountInString(s)
Iterate runes for i, r := range s { ... }
String to runes []rune(s)
Rune to string string(r)
Classify rune unicode.IsLetter(r)
Check script unicode.Is(unicode.Greek, r)
Validate UTF-8 utf8.ValidString(s)
Normalize NFC norm.NFC.String(s)
Regex Unicode regexp.MustCompile(\pL+)

Go's model -- UTF-8 strings by convention, explicit rune type for code points, and powerful standard-library packages -- makes Unicode handling both efficient and safe. The key discipline is to remember that len() counts bytes, range iterates runes, and neither one counts grapheme clusters. When you need grapheme-level accuracy, reach for golang.org/x/text or a segmentation library.

Thêm trong Unicode in Code