Unicode in Ruby
Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international text without most of the pitfalls found in older languages. This guide explains Ruby's Encoding class, encoding conversion, and how to handle Unicode normalization and grapheme clusters in Ruby.
Ruby's approach to Unicode is distinctive: every string object carries its encoding as metadata. Rather than declaring a global encoding for all strings (as Python 3 does with Unicode or Go does with UTF-8), Ruby lets individual strings have different encodings and provides methods to convert between them. Since Ruby 2.0, the default source encoding and string encoding is UTF-8, making Unicode handling straightforward for modern Ruby code.
Encoding-Aware Strings
Every Ruby string has an associated encoding, accessible via the encoding
method:
s = "Hello, 世界"
s.encoding # => #<Encoding:UTF-8>
s.bytes.to_a # => [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140]
s.length # => 9 (characters, not bytes)
s.bytesize # => 13 (bytes)
Two critical observations:
length(aliased assize) counts characters (code points), not bytesbytesizecounts raw bytes
This is the opposite of Go (where len() counts bytes) and similar to Python 3
(where len() counts code points).
The Magic Comment (Pre-Ruby 2.0)
Before Ruby 2.0, the default source encoding was US-ASCII. To use Unicode in source files, you needed a magic comment:
# encoding: utf-8
# or
# -*- coding: utf-8 -*-
name = "日本語" # would raise SyntaxError without the comment in Ruby 1.9
Since Ruby 2.0, UTF-8 is the default. The magic comment is no longer needed for UTF-8 source files, but it is still supported for other encodings.
String Encoding Types
Ruby supports over 100 encodings. The most commonly used:
| Encoding | Identifier | Bytes per Character |
|---|---|---|
| UTF-8 | Encoding::UTF_8 |
1--4 |
| ASCII | Encoding::US_ASCII |
1 |
| ISO-8859-1 (Latin-1) | Encoding::ISO_8859_1 |
1 |
| Shift_JIS | Encoding::Shift_JIS |
1--2 |
| EUC-JP | Encoding::EUC_JP |
1--3 |
| UTF-16LE | Encoding::UTF_16LE |
2--4 |
| UTF-32LE | Encoding::UTF_32LE |
4 |
| ASCII-8BIT (binary) | Encoding::ASCII_8BIT |
1 |
ASCII-8BIT: The Binary Encoding
ASCII-8BIT (aliased as BINARY) is Ruby's way of saying "these are raw bytes
with no character semantics." It is used for binary data, network I/O, and
situations where you do not want Ruby to interpret the bytes as text:
data = "\xFF\xFE".b # .b forces ASCII-8BIT encoding
data.encoding # => #<Encoding:ASCII-8BIT>
data.valid_encoding? # => true (any byte sequence is valid in BINARY)
Creating and Converting Strings
Encoding Conversion
utf8_str = "café"
utf8_str.encoding # => #<Encoding:UTF-8>
# Convert to ISO-8859-1 (Latin-1)
latin1 = utf8_str.encode("ISO-8859-1")
latin1.encoding # => #<Encoding:ISO-8859-1>
latin1.bytesize # => 4 (é is 1 byte in Latin-1)
# Convert back
back = latin1.encode("UTF-8")
back == utf8_str # => true
Handling Invalid Bytes
# Replace invalid/undefined characters
bad = "\xFF".force_encoding("UTF-8")
bad.valid_encoding? # => false
# Scrub replaces invalid bytes with a replacement character
bad.scrub # => "�" (U+FFFD)
bad.scrub("?") # => "?"
bad.scrub { |bytes| "<#{bytes.unpack1('H*')}>" } # => "<ff>"
# encode with fallback options
"café".encode("ASCII", fallback: ->(_) { "?" })
# => "caf?"
# Using invalid/undef/replace options
"café".encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "caf?"
The scrub method (Ruby 2.1+) is the recommended way to sanitize strings with
potentially invalid byte sequences.
Unicode Code Points and Characters
Escape Sequences
Ruby supports Unicode escapes in double-quoted strings and regex:
# \\u with 4 hex digits (BMP only)
"\u0041" # => "A"
"\u00E9" # => "é"
"\u4E16" # => "世"
# \\u{} with 1-6 hex digits (full Unicode range)
"\u{1F600}" # => "😀"
"\u{41 42 43}" # => "ABC" (multiple code points)
"\u{1F1EF 1F1F5}" # => "🇯🇵" (flag: Japan)
The \u{} syntax is especially useful because it handles supplementary
characters without surrogate pairs and allows multiple code points in one
expression.
Accessing Code Points
s = "Hello 😀"
# Get code points as an array
s.codepoints # => [72, 101, 108, 108, 111, 32, 128512]
# Convert code point to character
128512.chr(Encoding::UTF_8) # => "😀"
0x1F600.chr(Encoding::UTF_8) # => "😀"
# ord — first character's code point
"A".ord # => 65
"😀".ord # => 128512
Character Iteration
s = "café"
# Iterate over characters (code points)
s.each_char { |c| print "#{c} " } # c a f é
# Iterate over bytes
s.each_byte { |b| printf("%02X ", b) } # 63 61 66 C3 A9
# Iterate over code points
s.each_codepoint { |cp| printf("U+%04X ", cp) }
# U+0063 U+0061 U+0066 U+00E9
String Methods and Unicode
Ruby's string methods are encoding-aware and work correctly with multi-byte characters:
Length and Indexing
s = "日本語"
s.length # => 3 (characters)
s.bytesize # => 9 (bytes: 3 chars x 3 bytes each)
s[0] # => "日"
s[1] # => "本"
s[-1] # => "語"
s[0..1] # => "日本"
Unlike C and Go, Ruby's [] operator indexes by character position, not
byte offset. This is safe and intuitive for Unicode text.
Case Conversion
"café".upcase # => "CAFÉ"
"ΣΕΛΉΝΗ".downcase # => "σελήνη"
"straße".upcase # => "STRASSE" (ß → SS)
"hello world".capitalize # => "Hello world"
# Turkish locale requires explicit handling
"I".downcase(:turkic) # => "ı" (Ruby 2.4+)
"i".upcase(:turkic) # => "İ"
Ruby 2.4 introduced locale-aware case conversion options (:turkic,
:lithuanian).
Pattern Matching with Unicode
Ruby's regex engine (Onigmo) has excellent Unicode support:
# \\p{} property escapes
"café 日本語".scan(/\p{L}+/) # => ["café", "日本語"]
"hello 123".scan(/\p{N}+/) # => ["123"]
"café".scan(/\p{Latin}+/) # => ["café"]
"Москва".scan(/\p{Cyrillic}+/) # => ["Москва"]
# \\p{} categories
/\p{Lu}/ # Uppercase letter
/\p{Ll}/ # Lowercase letter
/\p{Nd}/ # Decimal digit
/\p{Sm}/ # Math symbol
/\p{Sc}/ # Currency symbol
# Named scripts
/\p{Han}/ # CJK ideographs
/\p{Greek}/ # Greek letters
/\p{Arabic}/ # Arabic letters
/\p{Emoji}/ # Emoji characters (Ruby 2.5+)
String Comparison and Normalization
Ruby does not include a built-in normalization method in the core library, but
ActiveSupport (from Rails) provides one, and the unicode_utils gem or the
built-in unicode_normalize method (Ruby 2.2+) handle normalization:
# Ruby 2.2+ built-in normalization
composed = "\u00E9" # é (NFC)
decomposed = "e\u0301" # é (NFD)
composed == decomposed # => false
composed == decomposed.unicode_normalize(:nfc) # => true
# Available forms
"text".unicode_normalize(:nfc) # NFC (recommended for storage)
"text".unicode_normalize(:nfd) # NFD
"text".unicode_normalize(:nfkc) # NFKC
"text".unicode_normalize(:nfkd) # NFKD
# Check normalization
"text".unicode_normalized?(:nfc) # => true/false
File I/O and Encoding
Reading Files
# Default: UTF-8
text = File.read("data.txt")
text.encoding # => #<Encoding:UTF-8>
# Specify encoding explicitly
text = File.read("legacy.txt", encoding: "Shift_JIS")
text.encoding # => #<Encoding:Shift_JIS>
# Read as binary then convert
raw = File.binread("data.bin") # ASCII-8BIT
raw.force_encoding("UTF-8") if raw.valid_encoding?
# Read with encoding conversion
text = File.read("legacy.txt", encoding: "Shift_JIS:UTF-8")
# Reads as Shift_JIS, converts to UTF-8
text.encoding # => #<Encoding:UTF-8>
Writing Files
File.write("output.txt", "日本語\n") # default UTF-8
File.write("output.txt", text, encoding: "UTF-16LE") # explicit encoding
The external/internal Encoding
Ruby distinguishes between external encoding (how data is stored on disk) and internal encoding (how it is represented in memory):
# Set default encodings
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
# Per-file encoding
File.open("data.txt", "r:Shift_JIS:UTF-8") do |f|
line = f.gets
line.encoding # => #<Encoding:UTF-8> (auto-converted from Shift_JIS)
end
Common Pitfalls
1. Byte Position vs. Character Position
s = "café"
s.bytesize # => 5
s.length # => 4
s.byteslice(3) # => "\xC3" (raw byte, NOT a valid character)
s[3] # => "é" (correct character access)
Use [] for character access and byteslice only when you explicitly need
byte-level operations.
2. Encoding Mismatch Errors
utf8 = "hello"
latin1 = "world".encode("ISO-8859-1")
utf8 + latin1 # => Encoding::CompatibilityError!
# Fix: convert to the same encoding first
utf8 + latin1.encode("UTF-8") # => "helloworld"
3. frozen_string_literal and Encoding
# frozen_string_literal: true
s = "café"
s.encoding # => #<Encoding:UTF-8>
# s << "!" # => FrozenError (cannot modify frozen string)
Frozen strings still carry encoding information and can be compared and searched, but not mutated.
4. Grapheme Clusters
Ruby's length counts code points, not grapheme clusters:
family = "\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
family.length # => 7 (code points)
# Visually: one family emoji 👨👩👧👦
# Ruby 2.5+ grapheme cluster support
family.grapheme_clusters.length # => 1
The grapheme_clusters method (Ruby 2.5+) provides UAX #29 grapheme
segmentation.
Quick Reference
| Task | Code |
|---|---|
| Character count | s.length |
| Byte count | s.bytesize |
| Get encoding | s.encoding |
| Check validity | s.valid_encoding? |
| Sanitize | s.scrub |
| Code points | s.codepoints |
| Iterate chars | s.each_char { \|c\| ... } |
| Char to code point | c.ord |
| Code point to char | cp.chr(Encoding::UTF_8) |
| Convert encoding | s.encode("UTF-8") |
| Normalize NFC | s.unicode_normalize(:nfc) |
| Grapheme clusters | s.grapheme_clusters |
| Regex Unicode | /\p{L}+/ |
| Force encoding | s.force_encoding("UTF-8") |
Ruby's encoding-aware string model is one of the most flexible among mainstream
languages. The combination of per-string encoding metadata, safe character-based
indexing, built-in normalization, and powerful regex Unicode support makes Ruby
well-suited for applications that process text from multiple sources and
encodings. The key is to ensure all text is converted to UTF-8 as early as
possible in your pipeline, and to use valid_encoding? and scrub to handle
any malformed input gracefully.
Unicode in Code의 더 많은 가이드
Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …
JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …
Java's char type is a 16-bit UTF-16 code unit, not a full …
Go's string type is a sequence of bytes, and its rune type …
Rust's str and String types are guaranteed to be valid UTF-8, making …
C and C++ have historically poor Unicode support, with char being a …
PHP's built-in string functions operate on bytes rather than Unicode characters, which …
Swift's String type is designed with Unicode correctness as a first-class concern, …
HTML and CSS support Unicode characters directly and through escape sequences, allowing …
Unicode-aware regular expressions let you match characters by script, category, or property …
SQL databases store text in encodings and collations that determine how characters …
URLs are technically restricted to ASCII characters, so non-ASCII text must be …
Every major programming language has its own syntax for embedding Unicode characters …
JSON is defined as Unicode text and must be encoded in UTF-8, …