Basic Latin (ASCII) Block
The Basic Latin block (U+0000–U+007F) is the first Unicode block and covers the 128 original ASCII characters, including the English alphabet, digits, punctuation, and control characters. This deep dive explores every character in the Basic Latin block, its history, and its role as the foundation of all text encoding.
The Basic Latin block (U+0000–U+007F) is the foundation of Unicode. Its 128 code points map exactly to the original ASCII standard from 1963, making it the most universally supported range in all of computing. Every system that can handle Unicode can handle Basic Latin, and virtually every piece of text in any language that uses the Latin script will draw on at least some characters from this block.
History of ASCII
ASCII — the American Standard Code for Information Interchange — was developed in the early 1960s under the auspices of the American National Standards Institute (ANSI). The first edition was published in 1963, with a major revision in 1967 and a final update in 1986. It was designed to allow different computers and communication equipment to exchange text reliably, standardizing 128 character assignments that could fit in a 7-bit byte.
When Unicode was designed in the late 1980s and early 1990s, the designers made a deliberate decision to make the first 128 Unicode code points identical to ASCII. This ensured complete backward compatibility: any ASCII document is already valid UTF-8 (since the high bit of every ASCII byte is zero and UTF-8 encodes U+0000–U+007F as single bytes with the same binary values).
Block Layout
| Range | Category | Description |
|---|---|---|
| U+0000–U+001F | Control characters (C0) | Non-printable control codes |
| U+0020 | Separator | Space character |
| U+0021–U+002F | Punctuation | ! " # $ % & ' ( ) * + , - . / |
| U+0030–U+0039 | Digits | Arabic numerals 0–9 |
| U+003A–U+0040 | Punctuation | : ; < = > ? @ |
| U+0041–U+005A | Uppercase letters | A–Z |
| U+005B–U+0060 | Punctuation | `[ \ ] ^ _ `` |
| U+0061–U+007A | Lowercase letters | a–z |
| U+007B–U+007E | Punctuation | { | } ~ |
| U+007F | Control character | DEL (delete) |
Control Characters (U+0000–U+001F and U+007F)
The 33 control characters in this block are invisible and have specific semantic meanings inherited from the days of teletypes and terminal communication. Many are now obsolete, but several remain in active use:
- U+0009 CHARACTER TABULATION (HT / Tab) — horizontal tab, used for indentation in code and TSV data
- U+000A LINE FEED (LF) — the standard newline character on Unix/Linux/macOS
- U+000D CARRIAGE RETURN (CR) — used alone on classic Mac OS; combined with LF (CRLF) on Windows
- U+001B ESCAPE (ESC) — introduces ANSI escape sequences for terminal color and cursor control
- U+0000 NULL (NUL) — the C null terminator, marks end of strings in C-style programming
- U+007F DELETE (DEL) — originally triggered by punching all holes in paper tape
Printable Characters
The 95 printable characters (U+0020–U+007E) cover the practical needs of English text: the 26 uppercase and 26 lowercase Latin letters, 10 digits, 32 punctuation and symbol characters, and the space. This set drives the vast majority of source code, configuration files, URLs, and English prose on the internet.
Notable characters and their common uses:
- U+0022 QUOTATION MARK
"— string delimiter in most programming languages - U+0023 NUMBER SIGN
#— comments in Python, Ruby, shell scripts; markdown headings - U+0026 AMPERSAND
&— HTML entity prefix; logical AND in many languages - U+002F SOLIDUS
/— path separator on Unix; division operator - U+003C LESS-THAN SIGN
<— HTML/XML tag opener - U+0040 COMMERCIAL AT
@— email addresses; decorators in Python, Java - U+005C REVERSE SOLIDUS
\\— path separator on Windows; escape character in strings - U+007C VERTICAL LINE
|— pipe in Unix shell; bitwise OR; Markdown table separator
Encoding in UTF-8
One of UTF-8's most elegant properties is its treatment of Basic Latin. Every code point from U+0000 to U+007F is encoded as a single byte with the identical value — the high bit is always 0. This means:
- Any valid ASCII byte sequence is valid UTF-8
- Code that scans for ASCII delimiters (newlines, slashes, null bytes) works correctly inside UTF-8 text without modification
- The encoding overhead for ASCII-heavy text (English prose, source code, JSON) is zero
Why It Still Matters
Despite Unicode containing over 149,000 characters, Basic Latin remains disproportionately important:
- Source code: Nearly all programming language keywords, operators, and syntax use Basic Latin exclusively
- URLs: RFC 3986 limits URL characters to a subset of Basic Latin (with percent-encoding for others)
- Email addresses: The local part and domain of email addresses are restricted to ASCII
- JSON keys: While JSON values can use any Unicode, keys in APIs are typically ASCII for interoperability
- Domain names: Traditional DNS hostnames use only
[A-Za-z0-9-]from Basic Latin (IDN uses Punycode)
The block's tiny size belies its outsized role: in most English-language computing contexts, the 128 characters of Basic Latin constitute the overwhelming majority of bytes processed.
Ещё в Block Explorer
The Latin-1 Supplement block (U+0080–U+00FF) extends ASCII with accented Latin characters for …
The General Punctuation block (U+2000–U+206F) contains typographic spaces, dashes, quotation marks, and …
The Mathematical Operators block (U+2200–U+22FF) contains 256 symbols covering set theory, logic, …
The Arrows block (U+2190–U+21FF) contains 112 arrow characters including simple directional arrows, …
The Dingbats block (U+2700–U+27BF) was created to encode the Zapf Dingbats typeface …
The Miscellaneous Symbols block (U+2600–U+26FF) is one of Unicode's most eclectic, containing …
The CJK Unified Ideographs block (U+4E00–U+9FFF) is one of the largest Unicode …
The Hangul Syllables block (U+AC00–U+D7A3) contains 11,172 precomposed Korean syllable blocks algorithmically …
Emoji in Unicode span multiple blocks across the Supplementary Multilingual Plane, including …
The Currency Symbols block (U+20A0–U+20CF) contains dedicated Unicode characters for currencies that …
The Box Drawing block (U+2500–U+257F) and Block Elements block (U+2580–U+259F) provide characters …
The Enclosed Alphanumerics block (U+2460–U+24FF) contains circled numbers, parenthesized numbers and letters, …
The Geometric Shapes block (U+25A0–U+25FF) and related blocks contain squares, circles, triangles, …
The Musical Symbols block (U+1D100–U+1D1FF) is a Supplementary Multilingual Plane block containing …