💻

Unicode in Code

Language-specific guides for working with Unicode

15 guías en esta serie

1
Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. This guide covers everything developers need to know about Unicode in Python, from the str type to the unicodedata module and third-party libraries.

2
Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as surrogate pairs and cause unexpected string length and indexing behavior. This guide explains Unicode in JavaScript, covering ES6 improvements, the u flag in regex, and Intl APIs.

3
Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working with supplementary characters outside the BMP. This guide explains how Java handles Unicode strings, the difference between char and code points, and best practices for internationalized Java applications.

4
Unicode in Go

Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it easier to work with non-ASCII text than many older languages. This guide covers Go's unicode and unicode/utf8 packages, ranging over strings, and handling multibyte characters correctly.

5
Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode text handling at the type level. This guide explains Rust's Unicode guarantees, how to iterate over characters and bytes, and the unicode-segmentation crate for grapheme cluster support.

6
Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading to decades of encoding bugs. This guide covers modern approaches to Unicode in C/C++, including char8_t, char16_t, char32_t, and the ICU library.

7
Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international text without most of the pitfalls found in older languages. This guide explains Ruby's Encoding class, encoding conversion, and how to handle Unicode normalization and grapheme clusters in Ruby.

8
Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong results for multibyte text without the mbstring extension. This guide explains how to handle Unicode correctly in PHP using mb_ functions, intl, and modern PHP practices.

9
Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than code units. This guide explains Swift's Character and String types, their views (UTF-8, UTF-16, Unicode scalars), and how to work with emoji and complex characters.

10
Unicode in HTML & CSS

HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages without encoding issues. This guide covers the charset meta tag, HTML entity references, CSS unicode-range, and how to insert special characters in markup and styles.

11
Unicode in Regular Expressions

Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns far more robust for international text. This guide covers Unicode regex features across Python, JavaScript, PCRE, and Java, including \p{} properties and the u flag.

12
Unicode in SQL

SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and UTF-16 being the most common choices. This guide covers Unicode in MySQL, PostgreSQL, and SQLite, explaining how to choose the right charset, collation, and column type for international data.

13
Unicode in URLs

URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in a URL path or query string. This guide explains percent-encoding, Internationalized Resource Identifiers (IRIs), Punycode for domains, and how to encode and decode Unicode URLs correctly.

14
Unicode Escape Sequences: Cross-Language Reference

Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 in Java to \N{LATIN SMALL LETTER A} in Python. This guide is a cross-language reference for Unicode escape sequence syntax, covering Python, JavaScript, Java, Go, Rust, C++, and more.

15
How to Handle Unicode in APIs and JSON

JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still produce encoding bugs, garbled characters, and incorrectly escaped sequences. This guide explains how to handle Unicode correctly in REST APIs and JSON, including proper escaping, content-type headers, and validation.