空字符
U+0000(NUL),第一个Unicode/ASCII字符,在C/C++中用作字符串终止符,安全隐患:空字节注入可能在易受攻击的系统中截断字符串。
What Is the Null Character?
The Null Character is U+0000, the code point at position zero in the Unicode standard. It is also known as NUL, NULL, or \0. It was inherited from ASCII (where it is defined as 000 in octal, 0x00 in hex) and has the lowest possible code point value.
In many systems and programming languages, the null character serves as a string terminator — a sentinel value that marks the end of a string. In C and related languages, strings are arrays of bytes terminated by a \0 byte. In higher-level languages like Python and JavaScript, strings are length-counted rather than null-terminated, so \0 is a valid character that can appear anywhere in a string.
In C and C-Style Languages
// C: null-terminated strings
char str[] = "Hello";
// Stored as: H e l l o \0
// str[5] == '\0' (null terminator)
// strlen() counts bytes until \0
strlen("Hello\0World"); // 5 — stops at first \0
printf("%s\n", "Hello\0World"); // prints "Hello" only
The null terminator convention is the source of null injection attacks in security: if a high-level language allows \0 in strings but a lower-level system truncates at it, an attacker can craft inputs like "admin\0.jpg" to confuse the system.
In Python
Python strings are length-counted; \0 is a valid string character:
s = "Hello\x00World"
len(s) # 11 — counts the null
s[5] # "\x00"
"\x00" in s # True
print(s) # "Hello World" (terminal may hide the null)
# Null bytes cause errors with C-extension interfaces
import os
try:
os.stat("file\x00name") # ValueError: embedded null character
except ValueError as e:
print(e) # embedded null character
# Checking for null bytes
"\x00" in user_input # security check for null injection
user_input.replace("\x00", "") # strip null bytes
In JavaScript
// JavaScript strings can contain \0
const s = "Hello\u0000World";
s.length; // 11
s.charCodeAt(5); // 0
s.includes("\0"); // true
// alert() and DOM APIs may truncate or mishandle null characters
console.log(s); // "Hello World" (null is invisible in most consoles)
Security Implications
Null characters have been exploited in several vulnerability classes:
- Null byte injection in file paths:
"../etc/passwd\0.jpg"— C-levelfopensees the path as"../etc/passwd", ignoring the.jpgsuffix. - SQL injection with nulls: Some SQL parsers or ORMs may mishandle null bytes in query parameters.
- LDAP injection: Null bytes can terminate LDAP filter strings prematurely.
# Secure input validation: reject null bytes
def validate_filename(name: str) -> str:
if "\x00" in name:
raise ValueError("Filename contains null byte")
return name
In File Formats
Null bytes have specific roles in binary file formats: padding in fixed-width fields, record separators in some database formats, and terminators in null-padded string fields (common in C structs serialized to disk).
# Reading a fixed-width null-padded field from binary
raw_field = b"Alice\x00\x00\x00\x00\x00" # 10 bytes, null-padded
name = raw_field.rstrip(b"\x00").decode("utf-8") # "Alice"
Unicode Status
In Unicode, U+0000 is a valid code point but a restricted character in several contexts:
- XML forbids U+0000 in documents.
- UTF-8 encoding of U+0000 is the single byte 0x00 (not the modified UTF-8 0xC0 0x80). Java's Modified UTF-8 encodes it as 0xC0 0x80 to avoid embedded nulls.
Quick Facts
| Property | Value |
|---|---|
| Code point | U+0000 |
| Name | NULL (NUL) |
| ASCII equivalent | \0 |
| C usage | String terminator |
| Python/JS | Valid string character (length-counted strings) |
| UTF-8 encoding | 0x00 (single byte) |
| XML | Forbidden in XML documents |
| Security risk | Null injection — always sanitize in security-sensitive contexts |
相关术语
编程与开发 中的更多内容
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
使用Unicode属性的正则表达式模式:\p{L}(任意字母)、\p{Script=Greek}(希腊文字)、\p{Emoji},各语言和正则引擎的支持程度不同。
在源代码中表示Unicode字符的语法,各语言不同:\u2713(Python/Java/JS)、\u{2713}(JS/Ruby/Rust)、\U00012345(Python/C)。
无可见字形的字符:空白、零宽字符、控制字符和格式字符,可能引发欺骗和文本隐写等安全问题。
用错误编码解码字节时产生的乱码文本,来自日语词“文字化け”。例如:'café'以UTF-8存储但用Latin-1读取,显示为'café'。
在UTF-16中一起编码补充字符的两个16位码元(高代理U+D800–U+DBFF + 低代理U+DC00–U+DFFF),😀 = D83D DE00。
编程语言中的字符序列,内部表示各异:UTF-8(Go、Rust、新版Python)、UTF-16(Java、JavaScript、C#)或UTF-32(Python)。
Unicode字符串的“长度”取决于计量单位:码元(JavaScript .length)、码位(Python len())或字素簇。👨👩👧👦 = 7个码位,1个字素。