📊 编码可视化工具
在字节级别可视化字符如何以 UTF-8、UTF-16 和 UTF-32 编码,展示头部位、有效载荷位和代理对。
UTF-8
UTF-16
UTF-32
固定 4 字节——直接码点值
对比
输入字符,查看其在 UTF-8、UTF-16 和 UTF-32 中的编码方式
Type or paste any character or short string into the input field. Single characters reveal the most detail; strings show how multiple code points chain together as byte sequences in each encoding form.
The tool renders each encoding as a color-coded byte diagram. UTF-8 shows the bit pattern that signals 1-, 2-, 3-, or 4-byte sequences; UTF-16 shows each 16-bit code unit and surrogate pairs; UTF-32 shows the fixed 4-byte representation. Hover over bytes to see the bit values and their structural role in the encoding.
Use the side-by-side comparison view to see how the same character differs in byte count, bit layout, and hexadecimal representation across UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, Latin-1, and Windows-1252. This view is especially valuable for understanding why supplementary plane characters require more bytes.
Character encoding is the mechanism by which abstract Unicode code points become concrete sequences of bytes suitable for storage and transmission. While Unicode defines the character inventory and assigns code points, encoding transforms those code points into binary data. The three main Unicode encoding forms — UTF-8, UTF-16, and UTF-32 — make different trade-offs between storage efficiency, processing simplicity, and compatibility with existing systems.
UTF-8 dominates the web (over 98% of web pages) because it uses only one byte for ASCII characters (U+0000–U+007F), making it compact for English and programming language text while still supporting all 1.1 million Unicode code points. It uses 2–4 bytes for non-ASCII characters, with the byte count determined by the code point's range. UTF-16 uses 2 bytes for Basic Multilingual Plane characters and 4 bytes (a surrogate pair) for supplementary characters, making it efficient for texts heavy in CJK characters. UTF-32 uses a fixed 4 bytes per code point regardless, which simplifies random access at the cost of 2–4x more storage than UTF-8 for typical text.
Visualizing encodings at the bit and byte level demystifies the machinery behind internationalized software. Understanding why a 4-byte emoji occupies 2 JavaScript string indices, why a 3-byte CJK character becomes a 2-byte UTF-16 code unit, or why UTF-8's self-synchronizing property enables fast error recovery — these concepts are essential for engineers building text editors, database schemas, network protocols, and any system where strings cross language or platform boundaries. Encoding awareness prevents the class of bugs that arise from conflating bytes, code units, and code points.
在字节级别可视化字符如何以 UTF-8、UTF-16 和 UTF-32 编码,展示头部位、有效载荷位和代理对。
固定 4 字节——直接码点值
输入字符,查看其在 UTF-8、UTF-16 和 UTF-32 中的编码方式