How to Use

  1. 1
    Enter a character or string to visualize

    Type or paste any character or short string into the input field. Single characters reveal the most detail; strings show how multiple code points chain together as byte sequences in each encoding form.

  2. 2
    Examine the byte-level visualization

    The tool renders each encoding as a color-coded byte diagram. UTF-8 shows the bit pattern that signals 1-, 2-, 3-, or 4-byte sequences; UTF-16 shows each 16-bit code unit and surrogate pairs; UTF-32 shows the fixed 4-byte representation. Hover over bytes to see the bit values and their structural role in the encoding.

  3. 3
    Compare encodings side by side

    Use the side-by-side comparison view to see how the same character differs in byte count, bit layout, and hexadecimal representation across UTF-8, UTF-16 LE/BE, UTF-32 LE/BE, Latin-1, and Windows-1252. This view is especially valuable for understanding why supplementary plane characters require more bytes.

About

Character encoding is the mechanism by which abstract Unicode code points become concrete sequences of bytes suitable for storage and transmission. While Unicode defines the character inventory and assigns code points, encoding transforms those code points into binary data. The three main Unicode encoding forms — UTF-8, UTF-16, and UTF-32 — make different trade-offs between storage efficiency, processing simplicity, and compatibility with existing systems.

UTF-8 dominates the web (over 98% of web pages) because it uses only one byte for ASCII characters (U+0000–U+007F), making it compact for English and programming language text while still supporting all 1.1 million Unicode code points. It uses 2–4 bytes for non-ASCII characters, with the byte count determined by the code point's range. UTF-16 uses 2 bytes for Basic Multilingual Plane characters and 4 bytes (a surrogate pair) for supplementary characters, making it efficient for texts heavy in CJK characters. UTF-32 uses a fixed 4 bytes per code point regardless, which simplifies random access at the cost of 2–4x more storage than UTF-8 for typical text.

Visualizing encodings at the bit and byte level demystifies the machinery behind internationalized software. Understanding why a 4-byte emoji occupies 2 JavaScript string indices, why a 3-byte CJK character becomes a 2-byte UTF-16 code unit, or why UTF-8's self-synchronizing property enables fast error recovery — these concepts are essential for engineers building text editors, database schemas, network protocols, and any system where strings cross language or platform boundaries. Encoding awareness prevents the class of bugs that arise from conflating bytes, code units, and code points.

FAQ

How does UTF-8 signal whether a byte is a single-byte character or part of a multi-byte sequence?
UTF-8 uses a precise prefix encoding defined in RFC 3629. Single-byte characters (U+0000–U+007F) have a leading 0 bit: 0xxxxxxx. Two-byte sequences start with 110xxxxx followed by a continuation byte 10xxxxxx. Three-byte sequences start with 1110xxxx and have two continuation bytes. Four-byte sequences start with 11110xxx and have three continuation bytes. Any byte beginning with 10xxxxxx is a continuation byte and must not appear at the start of a sequence. This design makes UTF-8 self-synchronizing: a decoder can resynchronize to a valid sequence boundary by scanning backward until it finds a byte that does not begin with 10.
What is the Byte Order Mark (BOM) and when does it cause problems?
The Byte Order Mark is the code point U+FEFF, which when encoded in UTF-16 or UTF-32 appears at the start of a file to signal byte order: the UTF-16 LE BOM is bytes 0xFF 0xFE, while UTF-16 BE is 0xFE 0xFF. In UTF-8, the BOM (bytes 0xEF 0xBB 0xBF) is unnecessary because UTF-8 has no byte order ambiguity, yet some Windows tools (notably Notepad) write it anyway. This causes problems when UTF-8 files with BOM are processed as plain text: the three extra bytes appear as garbage characters (often "" or "") at the beginning of parsed content, break HTTP header parsing if present in responses, and corrupt CSV and JSON files read by parsers that do not handle the BOM.
Why do UTF-16 and UTF-32 come in "little-endian" and "big-endian" variants?
Multi-byte integer representations differ between processor architectures: little-endian systems store the least significant byte first (x86, ARM in LE mode), while big-endian systems store the most significant byte first (network byte order, some RISC architectures). UTF-16 encodes most characters as 16-bit integers, which can be stored in either order. For example, U+0041 (A) is stored as 0x41 0x00 in UTF-16 LE but as 0x00 0x41 in UTF-16 BE. A Byte Order Mark at the start of a UTF-16 file signals which byte order to use. UTF-32 has the same issue with its 32-bit code units. UTF-8 has no byte order issue because all code units are single bytes.
What happens when a character is outside the range of Latin-1 or Windows-1252?
Latin-1 (ISO 8859-1) covers only U+0000–U+00FF, and Windows-1252 covers a similar range with different characters in the C1 control range (0x80–0x9F). Any Unicode character outside these ranges — including most Chinese, Japanese, Korean, Arabic, Hindi, and emoji — cannot be represented in these encodings. Attempting to encode such characters results in either an error (in strict mode) or a replacement character. This is why text containing characters from multiple scripts must use Unicode encodings (UTF-8 or UTF-16). Legacy encodings like Latin-1 were designed for single-script environments and are insufficient for international text.
How does the visual bit pattern of UTF-8 help detect encoding errors?
UTF-8's structured bit patterns make many encoding errors detectable without a complete decode pass. A valid UTF-8 stream has a specific ratio of leading bytes to continuation bytes: every n-byte sequence has exactly n-1 continuation bytes (10xxxxxx). Overlong encodings (encoding a character using more bytes than necessary) violate the minimum length rule and are explicitly invalid per RFC 3629. Sequences encoding code points in the surrogate range (U+D800–U+DFFF) are also invalid in UTF-8. Tools that visualize the bit structure can highlight these violations instantly, making UTF-8 validation intuitive rather than requiring knowledge of lookup tables or complex state machines.

📊 编码可视化工具

在字节级别可视化字符如何以 UTF-8、UTF-16 和 UTF-32 编码,展示头部位、有效载荷位和代理对。

试试: