💻 Unicode in Code

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international text without most of the pitfalls found in older languages. This guide explains Ruby's Encoding class, encoding conversion, and how to handle Unicode normalization and grapheme clusters in Ruby.

Published 2022-04-25 · Updated 2024-07-18

Ruby's approach to Unicode is distinctive: every string object carries its encoding as metadata. Rather than declaring a global encoding for all strings (as Python 3 does with Unicode or Go does with UTF-8), Ruby lets individual strings have different encodings and provides methods to convert between them. Since Ruby 2.0, the default source encoding and string encoding is UTF-8, making Unicode handling straightforward for modern Ruby code.

Encoding-Aware Strings

Every Ruby string has an associated encoding, accessible via the encoding method:

s = "Hello, 世界"
s.encoding          # => #<Encoding:UTF-8>
s.bytes.to_a        # => [72, 101, 108, 108, 111, 44, 32, 228, 184, 150, 231, 149, 140]
s.length            # => 9  (characters, not bytes)
s.bytesize          # => 13 (bytes)

Two critical observations:

length (aliased as size) counts characters (code points), not bytes
bytesize counts raw bytes

This is the opposite of Go (where len() counts bytes) and similar to Python 3 (where len() counts code points).

The Magic Comment (Pre-Ruby 2.0)

Before Ruby 2.0, the default source encoding was US-ASCII. To use Unicode in source files, you needed a magic comment:

# encoding: utf-8
# or
# -*- coding: utf-8 -*-

name = "日本語"    # would raise SyntaxError without the comment in Ruby 1.9

Since Ruby 2.0, UTF-8 is the default. The magic comment is no longer needed for UTF-8 source files, but it is still supported for other encodings.

String Encoding Types

Ruby supports over 100 encodings. The most commonly used:

Encoding	Identifier	Bytes per Character
UTF-8	`Encoding::UTF_8`	1--4
ASCII	`Encoding::US_ASCII`	1
ISO-8859-1 (Latin-1)	`Encoding::ISO_8859_1`	1
Shift_JIS	`Encoding::Shift_JIS`	1--2
EUC-JP	`Encoding::EUC_JP`	1--3
UTF-16LE	`Encoding::UTF_16LE`	2--4
UTF-32LE	`Encoding::UTF_32LE`	4
ASCII-8BIT (binary)	`Encoding::ASCII_8BIT`	1

ASCII-8BIT: The Binary Encoding

ASCII-8BIT (aliased as BINARY) is Ruby's way of saying "these are raw bytes with no character semantics." It is used for binary data, network I/O, and situations where you do not want Ruby to interpret the bytes as text:

data = "\xFF\xFE".b         # .b forces ASCII-8BIT encoding
data.encoding                # => #<Encoding:ASCII-8BIT>
data.valid_encoding?         # => true (any byte sequence is valid in BINARY)

Creating and Converting Strings

Encoding Conversion

utf8_str = "café"
utf8_str.encoding               # => #<Encoding:UTF-8>

# Convert to ISO-8859-1 (Latin-1)
latin1 = utf8_str.encode("ISO-8859-1")
latin1.encoding                 # => #<Encoding:ISO-8859-1>
latin1.bytesize                 # => 4 (é is 1 byte in Latin-1)

# Convert back
back = latin1.encode("UTF-8")
back == utf8_str                # => true

Handling Invalid Bytes

# Replace invalid/undefined characters
bad = "\xFF".force_encoding("UTF-8")
bad.valid_encoding?             # => false

# Scrub replaces invalid bytes with a replacement character
bad.scrub                       # => "�"  (U+FFFD)
bad.scrub("?")                  # => "?"
bad.scrub { |bytes| "<#{bytes.unpack1('H*')}>" }  # => "<ff>"

# encode with fallback options
"café".encode("ASCII", fallback: ->(_) { "?" })
# => "caf?"

# Using invalid/undef/replace options
"café".encode("ASCII", invalid: :replace, undef: :replace, replace: "?")
# => "caf?"

The scrub method (Ruby 2.1+) is the recommended way to sanitize strings with potentially invalid byte sequences.

Unicode Code Points and Characters

Escape Sequences

Ruby supports Unicode escapes in double-quoted strings and regex:

# \\u with 4 hex digits (BMP only)
"\u0041"                        # => "A"
"\u00E9"                        # => "é"
"\u4E16"                        # => "世"

# \\u{} with 1-6 hex digits (full Unicode range)
"\u{1F600}"                     # => "😀"
"\u{41 42 43}"                  # => "ABC" (multiple code points)
"\u{1F1EF 1F1F5}"               # => "🇯🇵" (flag: Japan)

The \u{} syntax is especially useful because it handles supplementary characters without surrogate pairs and allows multiple code points in one expression.

Accessing Code Points

s = "Hello 😀"

# Get code points as an array
s.codepoints                    # => [72, 101, 108, 108, 111, 32, 128512]

# Convert code point to character
128512.chr(Encoding::UTF_8)     # => "😀"
0x1F600.chr(Encoding::UTF_8)    # => "😀"

# ord — first character's code point
"A".ord                         # => 65
"😀".ord                        # => 128512

Character Iteration

s = "café"

# Iterate over characters (code points)
s.each_char { |c| print "#{c} " }      # c a f é

# Iterate over bytes
s.each_byte { |b| printf("%02X ", b) }  # 63 61 66 C3 A9

# Iterate over code points
s.each_codepoint { |cp| printf("U+%04X ", cp) }
# U+0063 U+0061 U+0066 U+00E9

String Methods and Unicode

Ruby's string methods are encoding-aware and work correctly with multi-byte characters:

Length and Indexing

s = "日本語"
s.length          # => 3  (characters)
s.bytesize        # => 9  (bytes: 3 chars x 3 bytes each)
s[0]              # => "日"
s[1]              # => "本"
s[-1]             # => "語"
s[0..1]           # => "日本"

Unlike C and Go, Ruby's [] operator indexes by character position, not byte offset. This is safe and intuitive for Unicode text.

Case Conversion

"café".upcase           # => "CAFÉ"
"ΣΕΛΉΝΗ".downcase       # => "σελήνη"
"straße".upcase         # => "STRASSE"  (ß → SS)
"hello world".capitalize # => "Hello world"

# Turkish locale requires explicit handling
"I".downcase(:turkic)   # => "ı"  (Ruby 2.4+)
"i".upcase(:turkic)     # => "İ"

Ruby 2.4 introduced locale-aware case conversion options (:turkic, :lithuanian).

Pattern Matching with Unicode

Ruby's regex engine (Onigmo) has excellent Unicode support:

# \\p{} property escapes
"café 日本語".scan(/\p{L}+/)        # => ["café", "日本語"]
"hello 123".scan(/\p{N}+/)          # => ["123"]
"café".scan(/\p{Latin}+/)           # => ["café"]
"Москва".scan(/\p{Cyrillic}+/)      # => ["Москва"]

# \\p{} categories
/\p{Lu}/       # Uppercase letter
/\p{Ll}/       # Lowercase letter
/\p{Nd}/       # Decimal digit
/\p{Sm}/       # Math symbol
/\p{Sc}/       # Currency symbol

# Named scripts
/\p{Han}/      # CJK ideographs
/\p{Greek}/    # Greek letters
/\p{Arabic}/   # Arabic letters
/\p{Emoji}/    # Emoji characters (Ruby 2.5+)

String Comparison and Normalization

Ruby does not include a built-in normalization method in the core library, but ActiveSupport (from Rails) provides one, and the unicode_utils gem or the built-in unicode_normalize method (Ruby 2.2+) handle normalization:

# Ruby 2.2+ built-in normalization
composed   = "\u00E9"           # é (NFC)
decomposed = "e\u0301"          # é (NFD)

composed == decomposed                     # => false
composed == decomposed.unicode_normalize(:nfc)   # => true

# Available forms
"text".unicode_normalize(:nfc)     # NFC (recommended for storage)
"text".unicode_normalize(:nfd)     # NFD
"text".unicode_normalize(:nfkc)    # NFKC
"text".unicode_normalize(:nfkd)    # NFKD

# Check normalization
"text".unicode_normalized?(:nfc)   # => true/false

File I/O and Encoding

Reading Files

# Default: UTF-8
text = File.read("data.txt")
text.encoding                    # => #<Encoding:UTF-8>

# Specify encoding explicitly
text = File.read("legacy.txt", encoding: "Shift_JIS")
text.encoding                    # => #<Encoding:Shift_JIS>

# Read as binary then convert
raw = File.binread("data.bin")   # ASCII-8BIT
raw.force_encoding("UTF-8") if raw.valid_encoding?

# Read with encoding conversion
text = File.read("legacy.txt", encoding: "Shift_JIS:UTF-8")
# Reads as Shift_JIS, converts to UTF-8
text.encoding                    # => #<Encoding:UTF-8>

Writing Files

File.write("output.txt", "日本語\n")           # default UTF-8
File.write("output.txt", text, encoding: "UTF-16LE")  # explicit encoding

The external/internal Encoding

Ruby distinguishes between external encoding (how data is stored on disk) and internal encoding (how it is represented in memory):

# Set default encodings
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8

# Per-file encoding
File.open("data.txt", "r:Shift_JIS:UTF-8") do |f|
  line = f.gets
  line.encoding    # => #<Encoding:UTF-8>  (auto-converted from Shift_JIS)
end

Common Pitfalls

1. Byte Position vs. Character Position

s = "café"
s.bytesize       # => 5
s.length         # => 4
s.byteslice(3)   # => "\xC3"  (raw byte, NOT a valid character)
s[3]             # => "é"     (correct character access)

Use [] for character access and byteslice only when you explicitly need byte-level operations.

2. Encoding Mismatch Errors

utf8 = "hello"
latin1 = "world".encode("ISO-8859-1")
utf8 + latin1    # => Encoding::CompatibilityError!

# Fix: convert to the same encoding first
utf8 + latin1.encode("UTF-8")   # => "helloworld"

3. frozen_string_literal and Encoding

# frozen_string_literal: true

s = "café"
s.encoding       # => #<Encoding:UTF-8>
# s << "!"       # => FrozenError (cannot modify frozen string)

Frozen strings still carry encoding information and can be compared and searched, but not mutated.

4. Grapheme Clusters

Ruby's length counts code points, not grapheme clusters:

family = "\u{1F468}\u{200D}\u{1F469}\u{200D}\u{1F467}\u{200D}\u{1F466}"
family.length           # => 7 (code points)
# Visually: one family emoji 👨‍👩‍👧‍👦

# Ruby 2.5+ grapheme cluster support
family.grapheme_clusters.length  # => 1

The grapheme_clusters method (Ruby 2.5+) provides UAX #29 grapheme segmentation.

Quick Reference

Task	Code
Character count	`s.length`
Byte count	`s.bytesize`
Get encoding	`s.encoding`
Check validity	`s.valid_encoding?`
Sanitize	`s.scrub`
Code points	`s.codepoints`
Iterate chars	`s.each_char { \\|c\\| ... }`
Char to code point	`c.ord`
Code point to char	`cp.chr(Encoding::UTF_8)`
Convert encoding	`s.encode("UTF-8")`
Normalize NFC	`s.unicode_normalize(:nfc)`
Grapheme clusters	`s.grapheme_clusters`
Regex Unicode	`/\p{L}+/`
Force encoding	`s.force_encoding("UTF-8")`

Ruby's encoding-aware string model is one of the most flexible among mainstream languages. The combination of per-string encoding metadata, safe character-based indexing, built-in normalization, and powerful regex Unicode support makes Ruby well-suited for applications that process text from multiple sources and encodings. The key is to ensure all text is converted to UTF-8 as early as possible in your pipeline, and to use valid_encoding? and scrub to handle any malformed input gracefully.

Lainnya di Unicode in Code

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, …

Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing …

Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters …

Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be …

Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters …

How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, …

← Kembali ke Panduan