Unicode Fundamentals
Core concepts of Unicode and character encoding
20 bu serideki rehberler
Unicode is the universal character encoding standard that assigns a unique number to every character in every language. This complete guide explains what Unicode is, why it was created, and how it powers modern text on the internet.
UTF-8 is the dominant character encoding on the web, capable of representing every Unicode character using one to four bytes. This guide explains how UTF-8 works, why it became the default encoding, and how to use it correctly in your projects.
UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different trade-offs in size, speed, and compatibility. Learn the key differences and when to choose each encoding for your application.
A Unicode code point is the unique number assigned to each character in the Unicode standard, written in the form U+0041. This guide explains what code points are, how they are structured, and how they relate to the bytes stored in a file.
Unicode is divided into 17 planes, each containing up to 65,536 code points, with Plane 0 known as the Basic Multilingual Plane (BMP). This guide explains the structure of Unicode planes, what lives in each one, and why the BMP matters to developers.
The Byte Order Mark (BOM) is a special Unicode character used at the start of a text stream to signal its encoding and byte order. This guide explains what the BOM is, when it is necessary, and the common problems it can cause in modern applications.
Surrogate pairs are a mechanism in UTF-16 that allows code points outside the BMP to be represented using two 16-bit code units. This guide explains how surrogate pairs work, why they exist, and the bugs they can cause in JavaScript and other languages.
ASCII defined 128 characters for the English alphabet and was the foundation of text computing, but it could not handle the world's languages. This guide traces the journey from ASCII through code pages to Unicode and explains why a universal standard was essential.
The same visible character can be represented by multiple different byte sequences in Unicode, which causes silent bugs in string comparison, hashing, and search. This guide explains the four normalization forms — NFC, NFD, NFKC, and NFKD — and when to apply each.
The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of left-to-right and right-to-left scripts is displayed on screen. This guide explains how the algorithm works and the security risks that arise from abusing bidirectional control characters.
Every Unicode character belongs to a general category such as Letter, Number, Punctuation, or Symbol, which determines how it behaves in text processing. This guide walks through the Unicode general category system and shows how to use categories in code.
Unicode blocks are contiguous ranges of code points grouped by script or character type, making it easier to find and work with related characters. This guide explains the block structure, lists the most important blocks, and shows how to look up block membership.
Unicode assigns every character to a script property that identifies the writing system it belongs to, such as Latin, Arabic, or Han. This guide explains the Unicode script system, how it differs from blocks, and how scripts are used in internationalization and security.
Combining characters are Unicode code points that attach to a preceding base character to create accented letters, diacritics, and other modified forms. This guide explains how combining characters work, how they interact with normalization, and common pitfalls in string handling.
A single visible character on screen — called a grapheme cluster — can be made up of multiple Unicode code points, which means string length in most programming languages gives misleading results. This guide explains the difference between grapheme clusters and code points and how to handle them correctly.
Unicode confusables are characters that look identical or nearly identical to others, enabling homograph attacks where a malicious URL or username appears legitimate. This guide explains what confusables are, how attackers exploit them, and how to detect and prevent confusable-based spoofing.
Zero-width characters are invisible Unicode code points that affect text layout, joining, and direction without occupying any visible space. This guide explains the most important zero-width characters, their legitimate uses, and how they are abused for data exfiltration and plagiarism detection.
Unicode defines over two dozen whitespace characters beyond the ordinary space, including non-breaking spaces, thin spaces, and various-width spaces used in typography. This guide catalogs all Unicode whitespace characters, explains their purposes, and shows how to handle them safely in code.
Unicode began in 1987 as a collaboration between engineers at Apple and Xerox who wanted to replace hundreds of incompatible encoding systems with a single universal standard. This guide tells the story of how Unicode was created, standardized, and adopted to become the foundation of modern text.
Unicode has released major versions regularly since 1.0 in 1991, with each release adding thousands of new characters, emoji, and scripts from around the world. This timeline covers every Unicode version, its key additions, and how the standard has grown to cover over 140,000 characters.