What is Byte Order Mark (BOM)?

U+FEFF placed at the start of a text stream to indicate byte order and encoding. Essential for UTF-16/32, optional and discouraged for UTF-8.

Variable-length Unicode encoding using 1–4 bytes per character. The dominant encoding of the web (98%+ of websites) with full ASCII backward compatibility.

Variable-length Unicode encoding using 2 or 4 bytes (1 or 2 code units of 16 bits). Used internally by Java, JavaScript, and Windows.

Fixed-length Unicode encoding using exactly 4 bytes per character. Simple but space-inefficient; used internally by Python 3 (CPython).

What is Character Encoding?

A system that maps characters to byte sequences for digital storage and transmission. Every text file has an encoding — the question is whether it is declared correctly.

Byte Order Mark (BOM) — Unicode Glossary

What is the Byte Order Mark?

The Byte Order Mark (BOM) is a special Unicode character, U+FEFF (ZERO WIDTH NO-BREAK SPACE in its original meaning, but repurposed as a BOM), placed at the very beginning of a text stream to signal the byte order (endianness) and encoding of the data that follows. When a BOM appears as the first bytes of a file, parsers use it to detect how to interpret the rest of the content.

The BOM is not a visible character. When used as intended (at the start of a file), it communicates metadata to parsers without affecting the rendered text. However, because many parsers do not handle it correctly, the BOM is a frequent source of mysterious bugs — particularly the infamous three-character garble ï»¿ that appears at the top of pages generated from UTF-8 BOM files processed by software that doesn't strip it.

BOM by Encoding

Encoding	BOM Bytes	Hex
UTF-8	`EF BB BF`	EF BB BF
UTF-16 BE	`FE FF`	FE FF
UTF-16 LE	`FF FE`	FF FE
UTF-32 BE	`00 00 FE FF`	00 00 FE FF
UTF-32 LE	`FF FE 00 00`	FF FE 00 00

The UTF-32 LE BOM (FF FE 00 00) and UTF-16 LE BOM (FF FE) share the same first two bytes, so parsers must check all 4 bytes to distinguish UTF-32 LE from UTF-16 LE.

When the BOM is Meaningful vs. Optional

UTF-16 and UTF-32: The BOM is essentially mandatory when byte order is not externally specified. Without a BOM, the reader must guess or use a default. The BOM unambiguously resolves this.

UTF-8: The BOM is meaningless for byte-order purposes (UTF-8 is byte-order independent) but is sometimes used to signal "this file is UTF-8." The IANA standard for UTF-8 explicitly discourages the BOM. The Unicode Standard says it "may" be used. In practice: - Avoid UTF-8 BOM for web content, Unix text files, source code, and JSON. - Windows Notepad historically added a UTF-8 BOM; newer versions (Windows 10 1903+) stopped this default. - Some Windows tools (Visual Studio, Excel CSV export) still produce UTF-8 with BOM.

The `ï»¿` Problem

The three bytes EF BB BF (UTF-8 BOM) decoded as Windows-1252 or Latin-1 produce the characters ï»¿. This is the most common visible manifestation of BOM mishandling. It appears:

At the top of PHP-generated web pages where the PHP file was saved with a BOM
At the start of CSV files exported from Microsoft Excel and then parsed naively
In the <title> or first text node of HTML files saved from Windows text editors

# Python: BOM handling
with open('bomfile.txt', 'r', encoding='utf-8-sig') as f:
    # 'utf-8-sig' strips the BOM on read, adds it on write
    content = f.read()

# Without 'utf-8-sig', the BOM becomes character U+FEFF in the string
with open('bomfile.txt', 'r', encoding='utf-8') as f:
    content = f.read()
    if content.startswith('\ufeff'):
        content = content[1:]  # Manual BOM strip

# Detecting BOM in binary data
with open('unknown.txt', 'rb') as f:
    raw = f.read(4)
if raw[:3] == b'\xef\xbb\xbf':
    print('UTF-8 with BOM')
elif raw[:2] == b'\xff\xfe' and raw[2:4] != b'\x00\x00':
    print('UTF-16 LE')
elif raw[:2] == b'\xfe\xff':
    print('UTF-16 BE')
elif raw[:4] == b'\xff\xfe\x00\x00':
    print('UTF-32 LE')
elif raw[:4] == b'\x00\x00\xfe\xff':
    print('UTF-32 BE')

Quick Facts

Property	Value
Unicode code point	U+FEFF
Character name	ZERO WIDTH NO-BREAK SPACE (original) / BOM (as used)
UTF-8 bytes	EF BB BF
UTF-16 LE bytes	FF FE
UTF-16 BE bytes	FE FF
Rendered width	Zero (invisible when used as BOM)
Python codec	utf-8-sig strips/adds BOM automatically

Common Pitfalls

PHP and the BOM. Any whitespace (including a BOM) before the <?php opening tag is sent as output, preventing HTTP headers from being set. The error message "headers already sent" in PHP is often caused by a BOM in a PHP source file. Fix: save PHP files as UTF-8 without BOM.

JSON files with BOM. The JSON specification (RFC 8259) explicitly prohibits a BOM at the start of a JSON document: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning." Many JSON parsers reject BOM-prefixed JSON.

U+FEFF as ZWNBSP inside text. U+FEFF appears in the middle of text (not just at the start) as a deprecated ZERO WIDTH NO-BREAK SPACE. Modern Unicode uses U+2060 (WORD JOINER) for this purpose. U+FEFF in the middle of a string is considered a formatting character that may cause invisible word-joining behavior.

Byte Order Mark (BOM)

What is the Byte Order Mark?

BOM by Encoding

When the BOM is Meaningful vs. Optional

The `ï»¿` Problem

Quick Facts

Common Pitfalls

Related Terms

More in Encoding

Byte Order Mark (BOM)

What is the Byte Order Mark?

BOM by Encoding

When the BOM is Meaningful vs. Optional

The ï»¿ Problem

Quick Facts

Common Pitfalls

Related Terms

More in Encoding

The `ï»¿` Problem