💻 Unicode in Code

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international text without most of the pitfalls found in older languages. This guide explains Ruby's Encoding class, encoding conversion, and how to handle Unicode normalization and grapheme clusters in Ruby.

·

Ruby's approach to Unicode is distinctive: every string object carries its encoding as metadata. Rather than declaring a global encoding for all strings (as Python 3 does with Unicode or Go does with UTF-8), Ruby lets individual strings have different encodings and provides methods to convert between them. Since Ruby 2.0, the default source encoding and string encoding is UTF-8, making Unicode handling straightforward for modern Ruby code.

Encoding-Aware Strings

Every Ruby string has an associated encoding, accessible via the encoding method:

s = "Hello, 世界"
s.encoding          # => #<Encoding:UTF-8>
s.bytes.to_a        # => [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140]
s.length            # => 9  (characters, not bytes)
s.bytesize          # => 13 (bytes)

Two critical observations:

  1. length (aliased as size) counts characters (code points), not bytes
  2. bytesize counts raw bytes

This is the opposite of Go (where len() counts bytes) and similar to Python 3 (where len() counts code points).

The Magic Comment (Pre-Ruby 2.0)

Before Ruby 2.0, the default source encoding was US-ASCII. To use Unicode in source files, you needed a magic comment:

# encoding: utf-8
# or
# -*- coding: utf-8 -*-

name = "日本語"    # would raise SyntaxError without the comment in Ruby 1.9

Since Ruby 2.0, UTF-8 is the default. The magic comment is no longer needed for UTF-8 source files, but it is still supported for other encodings.

String Encoding Types

Ruby supports over 100 encodings. The most commonly used:

Encoding Identifier Bytes per Character
UTF-8 Encoding::UTF_8 1--4
ASCII Encoding::US_ASCII 1
ISO-8859-1 (Latin-1) Encoding::ISO_8859_1 1
Shift_JIS Encoding::Shift_JIS 1--2
EUC-JP Encoding::EUC_JP 1--3
UTF-16LE Encoding::UTF_16LE 2--4
UTF-32LE Encoding::UTF_32LE 4
ASCII-8BIT (binary) Encoding::ASCII_8BIT 1

ASCII-8BIT: The Binary Encoding

ASCII-8BIT (aliased as BINARY) is Ruby's way of saying "these are raw bytes with no character semantics." It is used for binary data, network I/O, and situations where you do not want Ruby to interpret the bytes as text:

data = "\xFF\xFE".b         # .b forces ASCII-8BIT encoding
data.encoding                # => #<Encoding:ASCII-8BIT>
data.valid_encoding?         # => true (any byte sequence is valid in BINARY)

Creating and Converting Strings

Encoding Conversion

utf8_str = "café"
utf8_str.encoding               # => #<Encoding:UTF-8>

# Convert to ISO-8859-1 (Latin-1)
latin1 = utf8_str.encode("ISO-8859-1")
latin1.encoding                 # => #<Encoding:ISO-8859-1>
latin1.bytesize                 # => 4 (é is 1 byte in Latin-1)

# Convert back
back = latin1.encode("UTF-8")
back == utf8_str                # => true

Handling Invalid Bytes

# Replace invalid/undefined characters
bad = "\xFF".force_encoding("UTF-8")
bad.valid_encoding?             # => false

# Scrub replaces invalid bytes with a replacement character
bad.scrub                       # => "�"  (U+FFFD)
bad.scrub("?")                  # => "?"
bad.scrub { |bytes| "<#{bytes.unpack1('H*')}>" }  # => "<ff>"

# encode with fallback options
"café".encode("ASCII", fallback: ->(_) { "?" })
# => "caf?"

# Using invalid/undef/replace options
"café".encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "caf?"

The scrub method (Ruby 2.1+) is the recommended way to sanitize strings with potentially invalid byte sequences.

Unicode Code Points and Characters

Escape Sequences

Ruby supports Unicode escapes in double-quoted strings and regex:

# \\u with 4 hex digits (BMP only)
"\u0041"                        # => "A"
"\u00E9"                        # => "é"
"\u4E16"                        # => "世"

# \\u{} with 1-6 hex digits (full Unicode range)
"\u{1F600}"                     # => "😀"
"\u{41 42 43}"                  # => "ABC" (multiple code points)
"\u{1F1EF 1F1F5}"               # => "🇯🇵" (flag: Japan)

The \u{} syntax is especially useful because it handles supplementary characters without surrogate pairs and allows multiple code points in one expression.

Accessing Code Points

s = "Hello 😀"

# Get code points as an array
s.codepoints                    # => [72, 101, 108, 108, 111, 32, 128512]

# Convert code point to character
128512.chr(Encoding::UTF_8)     # => "😀"
0x1F600.chr(Encoding::UTF_8)    # => "😀"

# ord — first character's code point
"A".ord                         # => 65
"😀".ord                        # => 128512

Character Iteration

s = "café"

# Iterate over characters (code points)
s.each_char { |c| print "#{c} " }      # c a f é

# Iterate over bytes
s.each_byte { |b| printf("%02X ", b) }  # 63 61 66 C3 A9

# Iterate over code points
s.each_codepoint { |cp| printf("U+%04X ", cp) }
# U+0063 U+0061 U+0066 U+00E9

String Methods and Unicode

Ruby's string methods are encoding-aware and work correctly with multi-byte characters:

Length and Indexing

s = "日本語"
s.length          # => 3  (characters)
s.bytesize        # => 9  (bytes: 3 chars x 3 bytes each)
s[0]              # => "日"
s[1]              # => "本"
s[-1]             # => "語"
s[0..1]           # => "日本"

Unlike C and Go, Ruby's [] operator indexes by character position, not byte offset. This is safe and intuitive for Unicode text.

Case Conversion

"café".upcase           # => "CAFÉ"
"ΣΕΛΉΝΗ".downcase       # => "σελήνη"
"straße".upcase         # => "STRASSE"  (ß → SS)
"hello world".capitalize # => "Hello world"

# Turkish locale requires explicit handling
"I".downcase(:turkic)   # => "ı"  (Ruby 2.4+)
"i".upcase(:turkic)     # => "İ"

Ruby 2.4 introduced locale-aware case conversion options (:turkic, :lithuanian).

Pattern Matching with Unicode

Ruby's regex engine (Onigmo) has excellent Unicode support:

# \\p{} property escapes
"café 日本語".scan(/\p{L}+/)        # => ["café", "日本語"]
"hello 123".scan(/\p{N}+/)          # => ["123"]
"café".scan(/\p{Latin}+/)           # => ["café"]
"Москва".scan(/\p{Cyrillic}+/)      # => ["Москва"]

# \\p{} categories
/\p{Lu}/       # Uppercase letter
/\p{Ll}/       # Lowercase letter
/\p{Nd}/       # Decimal digit
/\p{Sm}/       # Math symbol
/\p{Sc}/       # Currency symbol

# Named scripts
/\p{Han}/      # CJK ideographs
/\p{Greek}/    # Greek letters
/\p{Arabic}/   # Arabic letters
/\p{Emoji}/    # Emoji characters (Ruby 2.5+)

String Comparison and Normalization

Ruby does not include a built-in normalization method in the core library, but ActiveSupport (from Rails) provides one, and the unicode_utils gem or the built-in unicode_normalize method (Ruby 2.2+) handle normalization:

# Ruby 2.2+ built-in normalization
composed   = "\u00E9"           # é (NFC)
decomposed = "e\u0301"          # é (NFD)

composed == decomposed                     # => false
composed == decomposed.unicode_normalize(:nfc)   # => true

# Available forms
"text".unicode_normalize(:nfc)     # NFC (recommended for storage)
"text".unicode_normalize(:nfd)     # NFD
"text".unicode_normalize(:nfkc)    # NFKC
"text".unicode_normalize(:nfkd)    # NFKD

# Check normalization
"text".unicode_normalized?(:nfc)   # => true/false

File I/O and Encoding

Reading Files

# Default: UTF-8
text = File.read("data.txt")
text.encoding                    # => #<Encoding:UTF-8>

# Specify encoding explicitly
text = File.read("legacy.txt", encoding: "Shift_JIS")
text.encoding                    # => #<Encoding:Shift_JIS>

# Read as binary then convert
raw = File.binread("data.bin")   # ASCII-8BIT
raw.force_encoding("UTF-8") if raw.valid_encoding?

# Read with encoding conversion
text = File.read("legacy.txt", encoding: "Shift_JIS:UTF-8")
# Reads as Shift_JIS, converts to UTF-8
text.encoding                    # => #<Encoding:UTF-8>

Writing Files

File.write("output.txt", "日本語\n")           # default UTF-8
File.write("output.txt", text, encoding: "UTF-16LE")  # explicit encoding

The external/internal Encoding

Ruby distinguishes between external encoding (how data is stored on disk) and internal encoding (how it is represented in memory):

# Set default encodings
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

# Per-file encoding
File.open("data.txt", "r:Shift_JIS:UTF-8") do |f|
  line = f.gets
  line.encoding    # => #<Encoding:UTF-8>  (auto-converted from Shift_JIS)
end

Common Pitfalls

1. Byte Position vs. Character Position

s = "café"
s.bytesize       # => 5
s.length         # => 4
s.byteslice(3)   # => "\xC3"  (raw byte, NOT a valid character)
s[3]             # => "é"     (correct character access)

Use [] for character access and byteslice only when you explicitly need byte-level operations.

2. Encoding Mismatch Errors

utf8 = "hello"
latin1 = "world".encode("ISO-8859-1")
utf8 + latin1    # => Encoding::CompatibilityError!

# Fix: convert to the same encoding first
utf8 + latin1.encode("UTF-8")   # => "helloworld"

3. frozen_string_literal and Encoding

# frozen_string_literal: true

s = "café"
s.encoding       # => #<Encoding:UTF-8>
# s << "!"       # => FrozenError (cannot modify frozen string)

Frozen strings still carry encoding information and can be compared and searched, but not mutated.

4. Grapheme Clusters

Ruby's length counts code points, not grapheme clusters:

family = "\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
family.length           # => 7 (code points)
# Visually: one family emoji 👨‍👩‍👧‍👦

# Ruby 2.5+ grapheme cluster support
family.grapheme_clusters.length  # => 1

The grapheme_clusters method (Ruby 2.5+) provides UAX #29 grapheme segmentation.

Quick Reference

Task Code
Character count s.length
Byte count s.bytesize
Get encoding s.encoding
Check validity s.valid_encoding?
Sanitize s.scrub
Code points s.codepoints
Iterate chars s.each_char { \|c\| ... }
Char to code point c.ord
Code point to char cp.chr(Encoding::UTF_8)
Convert encoding s.encode("UTF-8")
Normalize NFC s.unicode_normalize(:nfc)
Grapheme clusters s.grapheme_clusters
Regex Unicode /\p{L}+/
Force encoding s.force_encoding("UTF-8")

Ruby's encoding-aware string model is one of the most flexible among mainstream languages. The combination of per-string encoding metadata, safe character-based indexing, built-in normalization, and powerful regex Unicode support makes Ruby well-suited for applications that process text from multiple sources and encodings. The key is to ensure all text is converted to UTF-8 as early as possible in your pipeline, and to use valid_encoding? and scrub to handle any malformed input gracefully.

Lainnya di Unicode in Code