Unicode 指南

Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete …

UTF-8 Encoding Explained

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. …

What is a Unicode Code Point?

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the …

Unicode Planes and the BMP

A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form …

Understanding Byte Order Mark (BOM)

Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic …

Surrogate Pairs Explained

The Byte Order Mark (BOM) is a special Unicode character used at the start of a text stream to signal …

ASCII to Unicode: The Evolution of Character Encoding

Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit …

Unicode Normalization: NFC, NFD, NFKC, NFKD

ASCII defined 128 characters for the English alphabet and was the foundation of text computing, but it could not handle …

The Unicode Bidirectional Algorithm

The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string …

Unicode General Categories Explained

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. …

Understanding Unicode Blocks

Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves …

Unicode Scripts: How Writing Systems are Organized

Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and …

What are Combining Characters?

Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, …

Grapheme Clusters vs Code Points

Combining characters are Unicode code points that attach to a preceding base character to create accented letters, diacritics, and other …

Unicode Confusables: A Security Guide

A single visible character on screen — called a grapheme cluster — can be made up of multiple Unicode code …

Zero Width Characters: What They Are and Why They Matter

Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or …

Unicode Whitespace Characters Guide

Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This …

Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used …

History of Unicode

Unicode Versions Timeline

Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible …

Unicode has released major versions regularly since 1.0 in 1991, with each release adding thousands of new characters, emoji, and …

💻

Unicode in Code

Language-specific guides for working with Unicode

Unicode in Python

Python 3 uses Unicode strings by default, but correctly handling encoding, decoding, normalization, and grapheme clusters still requires careful attention. …

Unicode in JavaScript

JavaScript uses UTF-16 internally, which means characters outside the Basic Multilingual Plane — including most emoji — are stored as …

Unicode in Java

Java's char type is a 16-bit UTF-16 code unit, not a full Unicode character, which creates subtle bugs when working …

Unicode in Go

Go's string type is a sequence of bytes, and its rune type represents a single Unicode code point, making it …

Unicode in Rust

Rust's str and String types are guaranteed to be valid UTF-8, making it one of the safest languages for Unicode …

Unicode in C/C++

C and C++ have historically poor Unicode support, with char being a single byte and wchar_t having platform-dependent width, leading …

Unicode in Ruby

Ruby strings carry an explicit encoding, with UTF-8 being the default since Ruby 2.0, allowing developers to work with international …

Unicode in PHP

PHP's built-in string functions operate on bytes rather than Unicode characters, which means strlen, substr, and similar functions produce wrong …

Unicode in Swift

Swift's String type is designed with Unicode correctness as a first-class concern, representing characters as extended grapheme clusters rather than …

Unicode in HTML & CSS

Unicode in Regular Expressions

HTML and CSS support Unicode characters directly and through escape sequences, allowing developers to embed any character in web pages …

Unicode-aware regular expressions let you match characters by script, category, or property rather than just explicit byte ranges, making patterns …

Unicode in SQL

SQL databases store text in encodings and collations that determine how characters are saved, compared, and sorted, with UTF-8 and …

Unicode in URLs

Unicode Escape Sequences: Cross-Language Reference

URLs are technically restricted to ASCII characters, so non-ASCII text must be percent-encoded as UTF-8 bytes before being included in …

How to Handle Unicode in APIs and JSON

Every major programming language has its own syntax for embedding Unicode characters as escape sequences in string literals, from \u0041 …

Complete Arrow Symbols List

JSON is defined as Unicode text and must be encoded in UTF-8, UTF-16, or UTF-32, but many real-world APIs still …

🔣

Symbol Reference

Comprehensive guides to specific types of Unicode symbols

All Check Mark and Tick Symbols

Unicode contains hundreds of arrow symbols spanning simple directional arrows, double arrows, curved arrows, and specialized mathematical arrows. This complete …

Star and Asterisk Symbols

Unicode provides multiple check mark and tick symbols ranging from the classic ✓ (U+2713) to heavy check marks, ballot boxes, …

Heart Symbols Complete Guide

Unicode includes a rich collection of star shapes — from the simple five-pointed star ★ to outlined stars, six-pointed stars, …

Currency Symbols Around the World

Unicode contains dozens of heart symbols including the classic ♥, black and white hearts, suit hearts, decorative hearts, and a …

Mathematical Symbols and Operators

Unicode's Currency Symbols block and surrounding areas contain dedicated characters for over 60 world currencies, from the Dollar $ and …

Bracket and Parenthesis Symbols

Unicode has dedicated blocks for mathematical operators, arrows, letterlike symbols, and alphanumeric characters covering virtually every symbol used in modern …

Beyond the ASCII parentheses and square brackets, Unicode includes angle brackets, curly brackets, fullwidth brackets, double brackets, and dozens of …

Bullet Point Symbols

Line and Box Drawing Characters

Unicode offers a wide variety of bullet point characters beyond the standard •, including triangular bullets, dashes, arrows, and decorative …

Unicode's Box Drawing block contains 128 characters for drawing lines, corners, intersections, and borders in monospaced text environments like terminals …

Musical Note Symbols

Unicode includes musical note symbols such as ♩♪♫♬ in the Miscellaneous Symbols block, as well as a dedicated Musical Symbols …

Fraction Symbols Guide

Superscript and Subscript Characters

Unicode includes precomposed fraction characters for common fractions like ½ ¼ ¾ ⅓ and many others, allowing fractions to be …

Unicode provides precomposed superscript and subscript digits and letters — such as ¹ ² ³ and ₁ ₂ ₃ — …

Circle Symbols

Square and Rectangle Symbols

Unicode contains dozens of circle symbols including filled circles, outlined circles, circles with dots, circled letters, and the various combining …

Unicode includes filled squares, outlined squares, small squares, medium squares, dashed squares, and rectangle variants spread across the Geometric Shapes …

Triangle Symbols

Unicode provides a comprehensive set of triangle symbols in all orientations — up, down, left, right — in both filled …

Diamond Symbols

Unicode includes filled and outline diamond shapes, lozenge characters, and playing card suit diamonds across several blocks including Geometric Shapes …

Cross and X Mark Symbols

Dash and Hyphen Symbols Guide

Unicode provides various cross and X mark characters including the heavy ballot X ✗, multiplication signs, plus signs, and decorative …

Quotation Mark Symbols Complete Guide

The hyphen-minus on your keyboard is just one of Unicode's many dash characters, which also includes the en dash, em …

Unicode defines typographic quotation marks — curly quotes — for dozens of languages, including single and double guillemets, low-9 quotation …

Degree and Temperature Symbols

Unicode includes dedicated characters for the copyright symbol ©, registered trademark ®, trademark ™, service mark ℠, and other legal …

Circled and Enclosed Number Symbols

The degree symbol ° (U+00B0) and dedicated Celsius ℃ and Fahrenheit ℉ characters are among the most commonly searched Unicode …

Unicode's Enclosed Alphanumerics block provides circled numbers ①②③, parenthesized numbers ⑴⑵⑶, and other enclosed digit forms used in lists, annotations, …

Roman Numeral Symbols

Greek Alphabet Symbols for Math and Science

Unicode includes a Number Forms block with precomposed Roman numeral characters such as Ⅰ Ⅱ Ⅲ Ⅳ, distinct from the …

Greek letters like α β γ δ π Σ Ω are widely used as mathematical and scientific symbols and are …

Decorative Dingbats

The Unicode Dingbats block (U+2700–U+27BF) contains 192 decorative symbols originally from the Zapf Dingbats typeface, including scissors, pencils, fleurons, and …

Playing Card Symbols

Unicode includes a Playing Cards block with characters for all 52 standard playing cards plus jokers, card backs, and the …

Chess Piece Symbols

Zodiac and Astrological Symbols

Unicode provides characters for all six chess piece types in both white and black variants — ♔♕♖♗♘♙ and ♚♛♜♝♞♟ — …

Braille Pattern Characters

Unicode's Miscellaneous Symbols block includes the 12 zodiac signs ♈♉♊♋♌♍♎♏♐♑♒♓, planetary symbols, and other astrological characters used in astrology and …

Geometric Shapes Complete Guide

Unicode's Braille Patterns block (U+2800–U+28FF) encodes all 256 possible combinations of the 8-dot braille cell, enabling braille text to be …

Unicode's Geometric Shapes block contains 96 characters covering circles, squares, triangles, diamonds, and other polygons in filled, outlined, and various …

Letterlike Symbols

The Unicode Letterlike Symbols block contains mathematical and technical symbols derived from letters, such as ℃ ℉ ℕ ℤ ℝ …

Technical Symbols Guide

Combining Characters and Diacritics Guide

Unicode's Miscellaneous Technical block contains symbols from computing, electronics, and engineering, including keyboard keys, optical character recognition characters, and control …

Whitespace and Invisible Characters Guide

Diacritics are accent marks and other marks that attach to letters to indicate pronunciation or meaning, represented in Unicode as …

Unicode defines dozens of invisible characters beyond the ordinary space, including zero-width spaces, word joiners, soft hyphens, and various format …

Warning and Hazard Signs

Unicode includes warning and hazard symbols such as the universal caution ⚠ sign, biohazard ☣, radiation ☢, and skull and …

Weather Symbols Guide

Religious Symbols in Unicode

Unicode's Miscellaneous Symbols block includes sun ☀, cloud ☁, rain ☂, snow ❄, lightning ⚡, and other weather symbols used …

Gender and Identity Symbols

Unicode includes symbols for many of the world's major religions including the cross ✝, crescent and star ☪, Star of …

Keyboard Shortcut Symbols Guide

Unicode includes the traditional male ♂ and female ♀ symbols from astronomy, as well as newer gender-related characters and the …

Symbols for Social Media Bios

Apple's macOS uses Unicode characters for keyboard modifier keys such as ⌘ Command, ⌥ Option, ⇧ Shift, and ⌃ Control, …

Basic Latin (ASCII) Block

Unicode symbols like ▶ ◀ ► ★ ✦ ⚡ ✈ and hundreds of others are widely used in social media …

🧱

Block Explorer

Deep dives into notable Unicode blocks

The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers the 128 original ASCII characters, including the English …

Latin-1 Supplement Block

General Punctuation Block

The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for Western European languages and matches the original ISO …

Mathematical Operators Block

The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and various punctuation characters used in professional typography across …

The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, calculus, and other areas of mathematics used in …

Arrows Block

The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, double-headed arrows, curved arrows, and arrows with various …

Dingbats Block

Miscellaneous Symbols Block

The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface and contains 192 decorative ornaments, check marks, arrows, …

CJK Unified Ideographs Overview

The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing weather symbols, card suits, zodiac signs, religious symbols, …

The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode blocks, containing over 20,000 Chinese, Japanese, and Korean …

Hangul Block

The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically derived from 19 initial consonants, 21 vowels, and …

Emoji Blocks Overview

Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including Emoticons, Miscellaneous Symbols and Pictographs, and Transport and …

Currency Symbols Block

Box Drawing & Block Elements Blocks

The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that lack an ASCII representation, such as the Euro …

Enclosed Alphanumerics Block

The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters for drawing text-based user interfaces and terminal art …

The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, and various other enclosed alphanumeric forms used in …

Geometric Shapes Blocks

The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, and other polygons in filled, outlined, and partial …

Musical Symbols Block

The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing 233 characters for full western music notation, including …

📜

Script Stories

Writing system deep dives with history and culture

Arabic Script Deep Dive

Devanagari Script Deep Dive

Arabic is the third most widely used writing system in the world, written right-to-left with letters that change shape depending …

Devanagari is an abugida script used to write Hindi, Sanskrit, Marathi, and many other South Asian languages, with complex conjunct …

Greek and Coptic

Greek is one of the oldest alphabetic writing systems and gave Unicode many of its mathematical symbols, with the Greek …

Cyrillic Script

Cyrillic is used to write Russian, Ukrainian, Bulgarian, Serbian, and over 50 other languages, making it one of the most …

Hebrew Script

Hebrew is an abjad script written right-to-left, used for Biblical Hebrew, Modern Hebrew, and Yiddish, with optional vowel diacritics called …

Thai Script

Thai is an abugida script with no spaces between words, complex vowel placement above and below consonants, and tone marks …

Japanese Writing Systems

Japanese is unique in using three scripts simultaneously — Hiragana, Katakana, and Kanji (CJK ideographs) — alongside Latin text, making …

Korean Hangul System

Hangul was invented in 1443 by King Sejong as a scientific alphabet where syllable blocks are algorithmically composed from individual …

Bengali Script

Bengali is an abugida script with over 300 million speakers, used for Bengali and Assamese, featuring complex conjunct consonant forms …

Tamil Script

Tamil is one of the oldest living writing systems, with a literary tradition spanning over 2,000 years, and its script …

Armenian Script

The Armenian alphabet was created in 405 AD by the monk Mesrop Mashtots and has remained largely unchanged for over …

Georgian Script

Georgian has three distinct historical scripts — Mkhedruli, Asomtavruli, and Nuskhuri — all encoded in Unicode, with modern Georgian using …

Ethiopic Script

The Ethiopic script (Ge'ez) is an abugida used to write Amharic, Tigrinya, Oromo, and many other languages of the Horn …

Dead Scripts in Unicode

Writing Systems of the World

Unicode encodes dozens of historic and extinct scripts — from Cuneiform and Egyptian Hieroglyphs to Linear B and Gothic — …

How to Type Special Characters on Windows

There are hundreds of writing systems in use around the world today, from alphabets and syllabaries to abjads and abugidas, …

🔧

Practical Unicode

How-to guides for real-world Unicode tasks

How to Type Special Characters on Mac

Windows provides several methods for typing special characters and Unicode symbols, including Alt codes, the Character Map app, keyboard shortcuts, …

How to Type Special Characters on Linux

macOS makes it easy to type special characters and Unicode symbols through the Character Viewer, Option key shortcuts, the emoji …

Special Characters on Mobile (iOS/Android)

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by a hex code point, compose key sequences, IBus …

How to Fix Mojibake (Garbled Text)

Typing special Unicode characters on smartphones requires different techniques than on desktop — long-pressing keys, using third-party keyboards, or copy-pasting …

Mojibake is the garbled text you see when a file encoded in one character set is interpreted as another, producing …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, and column type — a wrong choice can …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems …

Unicode in Email

Unicode in Domain Names (IDN)

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, and addresses requires encoding schemes like Quoted-Printable, Base64, …

Unicode for Accessibility

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from any Unicode script, converted to ASCII-compatible encoding using …

Unicode Text Direction: LTR vs RTL

Using Unicode symbols, special characters, and emoji in web content has important accessibility implications for screen readers, which may announce …

Unicode Fonts: How Characters Get Rendered

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and explicit directional control characters, enabling correct display of …

How to Find Any Unicode Character

A font file only contains glyphs for a subset of Unicode characters, which means characters outside that subset fall back …

Unicode Copy and Paste Best Practices

Finding the exact Unicode character you need can be challenging given over 140,000 characters spread across 150+ scripts and dozens …

How to Create Fancy Text with Unicode

Copying and pasting text between applications can introduce invisible characters, change normalization forms, strip or mangle Unicode characters, and cause …

Unicode in Microsoft Word

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, Fraktur, monospace, and double-struck variants of Latin letters …

🖥️

Platform Guides

Unicode usage in specific platforms and applications

Unicode in Google Docs & Sheets

Microsoft Word supports the full Unicode character set and provides several methods for inserting special characters, including Alt+X code point …

Unicode in Terminal / Command Line

Google Docs and Sheets use UTF-8 internally and provide a Special Characters panel for inserting Unicode symbols by drawing, searching …

Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters requires a compatible terminal emulator, the right locale …

Unicode in PDF Documents

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many PDFs created from scans or older tools produce …

Unicode in Excel

Microsoft Excel stores text in Unicode but has historically struggled with non-Latin characters in CSV imports, RTL text layout, and …

Unicode in Social Media

Social media platforms handle Unicode text with varying degrees of support, affecting how emoji, RTL text, special characters, and invisible …

Unicode in XML and JSON

Unicode in Data Science and NLP

Both XML and JSON are defined to use Unicode text, but each has its own rules for encoding characters, escaping …

Natural language processing and data science pipelines frequently encounter Unicode issues including encoding errors, normalization mismatches, invisible characters, and language …

Unicode in QR Codes

Unicode in Passwords: Security Implications

QR codes can encode Unicode text using UTF-8, but many QR code generators and scanners default to ISO 8859-1, causing …

The Birth of ASCII (1963)

Allowing Unicode characters in passwords increases the keyspace and can improve security, but it also introduces normalization ambiguity, where the …

📖

Unicode History & Culture

Historical and cultural stories about Unicode

EBCDIC: IBM's Alternative

ASCII was created in 1963 by the American Standards Association to standardize communication between different computers and telegraph systems using …

The Unicode Consortium: Who Decides?

EBCDIC (Extended Binary Coded Decimal Interchange Code) was IBM's character encoding used on its mainframe computers, incompatible with ASCII and …

How New Characters Get Added to Unicode

The Unicode Consortium is the non-profit organization responsible for developing and maintaining the Unicode standard, with members including Apple, Google, …

The Emoji Proposal Process

Adding a new character to Unicode requires submitting a detailed proposal to the Unicode Consortium that demonstrates the character's need, …

CJK Unification: Controversy and Compromise

Getting a new emoji into Unicode requires a formal proposal to the Emoji Subcommittee that must meet strict selection criteria …

The Mojibake Problem: A History

CJK unification was Unicode's decision to assign the same code points to Chinese, Japanese, and Korean characters with the same …

Mojibake — Japanese for 'character transformation' — is the garbled text that results from encoding mismatches, a problem that plagued …

Unicode Milestones

How Unicode Changed the Internet

From the first Unicode draft in 1988 to the addition of emoji, the surpassing of 100,000 characters, and UTF-8 becoming …

Fun Unicode Facts and Easter Eggs

Before Unicode became universal, the web was fragmented by incompatible national encodings that made international websites unreliable and cross-language communication …

Unicode Security Overview

Unicode is full of surprising, obscure, and occasionally humorous characters — from the interrobang ‽ to the snowman ☃ with …

🔒

Unicode Security

Security implications of Unicode

IDN Homograph Attack Detection

Unicode's vast character set introduces a range of security vulnerabilities including homograph attacks, bidirectional text spoofing, normalization exploits, and invisible …

Invisible Character Detection and Removal

IDN homograph attacks use look-alike Unicode characters to register domain names that appear identical to legitimate ones — for example, …

Unicode in Passwords and Authentication

Zero-width and other invisible Unicode characters can be used to fingerprint text for tracking, hide malicious payloads in code, or …

Preventing Unicode-based Phishing

Unicode passwords introduce normalization ambiguity that can cause authentication failures or allow password bypasses when different normalization forms produce different …

Unicode Collation: Sorting Text Correctly

Phishing attacks increasingly exploit Unicode confusables, bidirectional overrides, and invisible characters to create deceptive URLs, spoofed sender addresses, and misleading …

🎓

Advanced Topics

Deep technical topics for experienced developers

ICU Library: International Components for Unicode

Sorting text correctly across languages requires the Unicode Collation Algorithm (UCA), which defines a multi-level comparison scheme that handles accents, …

The Future of Unicode: What Comes After 16.0?

ICU (International Components for Unicode) is the reference implementation library for Unicode and internationalization, providing collation, date/number formatting, transliteration, and …

Unicode in Compilers and Programming Language Design

Unicode 16.0 added thousands of new characters, but there are still hundreds of scripts, historic languages, and symbols waiting for …

Unicode Normalization Performance: Benchmarks

Modern programming language designers must decide how to handle Unicode in identifiers, string literals, source file encoding, and operator symbols …