การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 1–4 ไบต์ต่ออักขระ เป็นการเข้ารหัสที่นิยมใช้มากที่สุดบนเว็บ (มากกว่า 98% ของเว็บไซต์) และรองรับ ASCII แบบย้อนหลังอย่างสมบูรณ์

การเข้ารหัส Unicode แบบความยาวแปรผันที่ใช้ 2 หรือ 4 ไบต์ (1 หรือ 2 หน่วยรหัส 16 บิต) ใช้ภายในโดย Java, JavaScript และ Windows

การเข้ารหัส Unicode แบบความยาวคงที่ที่ใช้ 4 ไบต์ต่ออักขระ ง่ายแต่ไม่ประหยัดพื้นที่ ใช้ภายในโดย Python 3 (CPython)

What is การเข้ารหัสอักขระ?

ระบบที่แมปอักขระเป็นลำดับไบต์สำหรับการจัดเก็บและส่งผ่านข้อมูลดิจิทัล ทุกไฟล์ข้อความมีการเข้ารหัส คำถามคือมีการประกาศอย่างถูกต้องหรือไม่

การเข้ารหัส

เครื่องหมายลำดับไบต์

U+FEFF ที่วางไว้ที่ต้นสตรีมข้อความเพื่อระบุลำดับไบต์และการเข้ารหัส จำเป็นสำหรับ UTF-16/32 แต่เป็นทางเลือกและไม่แนะนำสำหรับ UTF-8

2021-06-15 · Updated 2024-09-25

What is the Byte Order Mark?

The Byte Order Mark (BOM) is a special Unicode character, U+FEFF (ZERO WIDTH NO-BREAK SPACE in its original meaning, but repurposed as a BOM), placed at the very beginning of a text stream to signal the byte order (endianness) and encoding of the data that follows. When a BOM appears as the first bytes of a file, parsers use it to detect how to interpret the rest of the content.

The BOM is not a visible character. When used as intended (at the start of a file), it communicates metadata to parsers without affecting the rendered text. However, because many parsers do not handle it correctly, the BOM is a frequent source of mysterious bugs — particularly the infamous three-character garble ï»¿ that appears at the top of pages generated from UTF-8 BOM files processed by software that doesn't strip it.

BOM by Encoding

Encoding	BOM Bytes	Hex
UTF-8	`EF BB BF`	EF BB BF
UTF-16 BE	`FE FF`	FE FF
UTF-16 LE	`FF FE`	FF FE
UTF-32 BE	`00 00 FE FF`	00 00 FE FF
UTF-32 LE	`FF FE 00 00`	FF FE 00 00

The UTF-32 LE BOM (FF FE 00 00) and UTF-16 LE BOM (FF FE) share the same first two bytes, so parsers must check all 4 bytes to distinguish UTF-32 LE from UTF-16 LE.

When the BOM is Meaningful vs. Optional

UTF-16 and UTF-32: The BOM is essentially mandatory when byte order is not externally specified. Without a BOM, the reader must guess or use a default. The BOM unambiguously resolves this.

UTF-8: The BOM is meaningless for byte-order purposes (UTF-8 is byte-order independent) but is sometimes used to signal "this file is UTF-8." The IANA standard for UTF-8 explicitly discourages the BOM. The Unicode Standard says it "may" be used. In practice: - Avoid UTF-8 BOM for web content, Unix text files, source code, and JSON. - Windows Notepad historically added a UTF-8 BOM; newer versions (Windows 10 1903+) stopped this default. - Some Windows tools (Visual Studio, Excel CSV export) still produce UTF-8 with BOM.

The `ï»¿` Problem

The three bytes EF BB BF (UTF-8 BOM) decoded as Windows-1252 or Latin-1 produce the characters ï»¿. This is the most common visible manifestation of BOM mishandling. It appears:

At the top of PHP-generated web pages where the PHP file was saved with a BOM
At the start of CSV files exported from Microsoft Excel and then parsed naively
In the <title> or first text node of HTML files saved from Windows text editors

# Python: BOM handling
with open('bomfile.txt', 'r', encoding='utf-8-sig') as f:
    # 'utf-8-sig' strips the BOM on read, adds it on write
    content = f.read()

# Without 'utf-8-sig', the BOM becomes character U+FEFF in the string
with open('bomfile.txt', 'r', encoding='utf-8') as f:
    content = f.read()
    if content.startswith('\ufeff'):
        content = content[1:]  # Manual BOM strip

# Detecting BOM in binary data
with open('unknown.txt', 'rb') as f:
    raw = f.read(4)
if raw[:3] == b'\xef\xbb\xbf':
    print('UTF-8 with BOM')
elif raw[:2] == b'\xff\xfe' and raw[2:4] != b'\x00\x00':
    print('UTF-16 LE')
elif raw[:2] == b'\xfe\xff':
    print('UTF-16 BE')
elif raw[:4] == b'\xff\xfe\x00\x00':
    print('UTF-32 LE')
elif raw[:4] == b'\x00\x00\xfe\xff':
    print('UTF-32 BE')

Quick Facts

Property	Value
Unicode code point	U+FEFF
Character name	ZERO WIDTH NO-BREAK SPACE (original) / BOM (as used)
UTF-8 bytes	EF BB BF
UTF-16 LE bytes	FF FE
UTF-16 BE bytes	FE FF
Rendered width	Zero (invisible when used as BOM)
Python codec	utf-8-sig strips/adds BOM automatically

Common Pitfalls

PHP and the BOM. Any whitespace (including a BOM) before the <?php opening tag is sent as output, preventing HTTP headers from being set. The error message "headers already sent" in PHP is often caused by a BOM in a PHP source file. Fix: save PHP files as UTF-8 without BOM.

JSON files with BOM. The JSON specification (RFC 8259) explicitly prohibits a BOM at the start of a JSON document: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning." Many JSON parsers reject BOM-prefixed JSON.

U+FEFF as ZWNBSP inside text. U+FEFF appears in the middle of text (not just at the start) as a deprecated ZERO WIDTH NO-BREAK SPACE. Modern Unicode uses U+2060 (WORD JOINER) for this purpose. U+FEFF in the middle of a string is considered a formatting character that may cause invisible word-joining behavior.

คำศัพท์ที่เกี่ยวข้อง

UTF-8 UTF-16 UTF-32 การเข้ารหัสอักขระ

เพิ่มเติมใน การเข้ารหัส

ASCII

มาตรฐานรหัสข้อมูลของอเมริกา (American Standard Code for Information Interchange) การเข้ารหัส 7 บิตครอบคลุม 128 ตัวอักษร …

ASCII Art

Visual art created from text characters, originally limited to the 95 printable …

Base64

Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …

Big5

การเข้ารหัสอักษรจีนตัวเต็มที่ใช้ส่วนใหญ่ในไต้หวันและฮ่องกง เข้ารหัสอักขระ CJK ประมาณ 13,000 ตัว

EBCDIC

Extended Binary Coded Decimal Interchange Code รหัสเข้ารหัสของเมนเฟรม IBM ที่มีช่วงตัวอักษรไม่ต่อเนื่อง ยังคงใช้ในธนาคารและเมนเฟรมองค์กร

EUC-KR

การเข้ารหัสอักขระภาษาเกาหลีที่อิงตาม KS X 1001 แมปอักษรฮันกึลและฮันจาเป็นลำดับสองไบต์

GB2312 / GB18030

กลุ่มการเข้ารหัสอักษรจีนตัวย่อ: GB2312 (6,763 อักขระ) พัฒนาเป็น GBK แล้วเป็น GB18030 ซึ่งเป็นมาตรฐานแห่งชาติจีนที่บังคับใช้และเข้ากันได้กับ Unicode

ISO 8859

กลุ่มการเข้ารหัสไบต์เดี่ยว 8 บิตสำหรับกลุ่มภาษาต่างๆ ISO 8859-1 (Latin-1) เป็นพื้นฐานของ 256 จุดรหัสแรกของ Unicode

Shift JIS

การเข้ารหัสอักขระภาษาญี่ปุ่นที่ผสม ASCII/JIS Roman แบบไบต์เดี่ยวกับคันจิ JIS X 0208 แบบสองไบต์ ยังคงใช้งานในระบบญี่ปุ่นรุ่นเก่า

UCS-2

การเข้ารหัส 2 ไบต์แบบความยาวคงที่ที่ล้าสมัย ครอบคลุมเฉพาะ BMP (U+0000–U+FFFF) เป็นรุ่นก่อนของ UTF-16 ที่ไม่สามารถแสดงอักขระเสริมได้

← กลับไปยังอภิธานศัพท์