制御文字
テキスト処理を制御する非印字文字。C0(U+0000〜U+001F):NUL・TAB・LF・CR・ESC。C1(U+0080〜U+009F):現代のUnicodeではほぼ使われません。一般カテゴリ:Cc。
What is a Control Character?
Control characters are Unicode (and ASCII) characters that do not represent printable glyphs but instead carry instructions that control the behavior of text-processing devices, terminals, communication protocols, and rendering systems. They occupy two code point ranges in Unicode:
- C0 controls: U+0000 through U+001F (32 characters) — the original ASCII control characters
- DEL: U+007F — originally the "delete" character from punched tape computing
- C1 controls: U+0080 through U+009F (32 characters) — extensions added in the ISO 8859 era for 8-bit systems
Together these 65 code points form what Unicode calls the control characters or the Cc general category.
Origins: Teletype and Mainframe Computing
Control characters were designed for the era of teletypes, punched tape, and serial communication. The original 32 ASCII control characters (0–31) encoded physical machine operations:
- U+0007 BEL (Bell): Ring the physical bell on a teletype to alert the operator
- U+0008 BS (Backspace): Move the print head one character to the left
- U+0009 HT (Horizontal Tab): Advance to the next tab stop
- U+000A LF (Line Feed): Advance paper by one line
- U+000D CR (Carriage Return): Return the print head to the beginning of the line
- U+001B ESC (Escape): Signal the start of a control sequence (used in terminal escape codes)
- U+007F DEL: Originally erased a character on punched tape by punching all holes
Many of these are now universally meaningful in modern computing: - LF (U+000A): Unix newline - CR+LF (U+000D U+000A): Windows newline - TAB (U+0009): Indentation in code and data files
The C1 Controls
The C1 range (U+0080–U+009F) was added for 8-bit character sets (ISO 8859) to extend control functionality into the 128–159 byte range. These include characters like:
- U+0085 NEL (Next Line): A newline variant used in IBM mainframe EBCDIC environments
- U+008D RI (Reverse Line Feed): Move cursor up one line
- U+009B CSI (Control Sequence Introducer): Alternative to ESC+[ for ANSI terminal sequences
C1 controls are rarely used in modern text but appear in legacy data, mainframe transfers, and occasionally in security exploits.
Control Characters in Modern Computing
Most control characters have no visual representation. Unicode assigns them the general category Cc (Control) and the bidirectional category BN (Boundary Neutral). They are not normally rendered as glyphs.
The ones still actively used in modern software:
| Character | Code Point | Modern Use |
|---|---|---|
| NUL | U+0000 | C string terminator; null byte in binary protocols |
| TAB | U+0009 | Indentation, TSV data format |
| LF | U+000A | Unix line endings |
| CR | U+000D | Part of Windows CRLF line endings |
| ESC | U+001B | ANSI escape sequences for terminal color/cursor |
| DEL | U+007F | Terminal delete key signal |
Security Considerations
Control characters are a significant source of security vulnerabilities:
- NUL byte injection (U+0000): In languages with C-string roots, an embedded NUL terminates a string. Filenames containing NUL can truncate in some contexts, enabling path traversal attacks.
- CRLF injection (U+000D U+000A): Inserting CRLF sequences into HTTP headers, email headers, or log entries can split headers and inject fake entries — a class of attack known as HTTP response splitting.
- Escape sequence injection (U+001B): Embedding ANSI escape sequences in log files can corrupt terminal displays or, in some terminal emulators, execute arbitrary commands.
- C1 control obfuscation: C1 controls like NEL (U+0085) are sometimes used to bypass input validation that only strips C0 characters.
- Bidi controls (U+202A–U+202E, U+2066–U+2069): Technically Format characters (Cf), not control characters (Cc), but closely related in their invisibility and security impact.
Input validation and sanitization routines in web applications should explicitly handle the full range of Unicode control characters, not just ASCII characters 0–31.
Quick Facts
| Property | Value |
|---|---|
| C0 range | U+0000 – U+001F (32 characters) |
| DEL | U+007F (1 character) |
| C1 range | U+0080 – U+009F (32 characters) |
| Unicode category | Cc (Control) |
| Total count | 65 code points |
| Still widely used | NUL, TAB, LF, CR, ESC |
| Security risks | NUL injection, CRLF injection, escape injection |
| Origin | Teletype and punched tape era (1960s ASCII) |