📚 Unicode Fundamentals

Understanding Byte Order Mark (BOM)

The Byte Order Mark (BOM) is a special Unicode character used at the start of a text stream to signal its encoding and byte order. This guide explains what the BOM is, when it is necessary, and the common problems it can cause in modern applications.

·

If you have ever opened a CSV file and seen a mysterious  at the beginning, or had a PHP script output a blank line before any HTML, or struggled with JSON parsing failures on files that look perfectly clean — you have probably encountered the Byte Order Mark (BOM). It is one of the most misunderstood characters in Unicode, serving a genuine technical purpose while causing endless headaches when it appears where it shouldn't. This guide explains what the BOM is, when it helps, when it hurts, and how to handle it in every major programming context.

What Is the BOM?

The Byte Order Mark is the Unicode character U+FEFF, named ZERO WIDTH NO-BREAK SPACE (though this name is now considered a misnomer — its no-break-space function was deprecated in Unicode 3.2 in favor of U+2060 WORD JOINER).

When placed at the beginning of a text stream, U+FEFF serves as a signature that indicates:

  1. Byte order (endianness) for UTF-16 and UTF-32 encodings
  2. Encoding identification — its byte pattern differs across UTF-8, UTF-16, and UTF-32, so reading the first few bytes can reveal which encoding was used

BOM Byte Sequences

Encoding BOM Bytes Hex
UTF-8 EF BB BF 0xEF 0xBB 0xBF
UTF-16 Big Endian FE FF 0xFE 0xFF
UTF-16 Little Endian FF FE 0xFF 0xFE
UTF-32 Big Endian 00 00 FE FF 0x00 0x00 0xFE 0xFF
UTF-32 Little Endian FF FE 00 00 0xFF 0xFE 0x00 0x00

The BOM works because U+FEFF has a specific byte pattern in each encoding, while the reverse byte order — U+FFFE — is a noncharacter permanently reserved by Unicode. If software reads 0xFF 0xFE at the start of a UTF-16 stream, it knows the byte order is little-endian. If it reads 0xFE 0xFF, it knows the byte order is big-endian.

Why Byte Order Matters

Byte order (endianness) is the arrangement of bytes within a multi-byte value. Modern CPUs fall into two camps:

Endianness Byte Order Example: U+0041 in UTF-16 Common CPUs
Big Endian (BE) Most significant byte first 00 41 SPARC, older PowerPC, network protocols
Little Endian (LE) Least significant byte first 41 00 x86, x86-64, ARM (default), Apple Silicon

UTF-8 does not have this problem because it defines its bytes in a fixed order — each byte in a multi-byte UTF-8 sequence has a role determined by its bit pattern, not by CPU architecture. But UTF-16 and UTF-32 use 16-bit and 32-bit code units respectively, and those multi-byte values can be stored in either byte order.

Without a BOM or an explicit encoding declaration, a reader seeing the bytes 00 41 has no way to know if they represent U+0041 (A) in big-endian or U+4100 (a CJK ideograph) in little-endian.

The UTF-8 BOM: Useful or Harmful?

UTF-8 has no byte-order ambiguity — its encoding is byte-order independent by design. So the UTF-8 BOM (EF BB BF) serves only as an encoding signature: "this file is UTF-8."

When the UTF-8 BOM Helps

  • Windows: Microsoft tools (Notepad, Excel, Visual Studio) historically use the BOM to distinguish UTF-8 from legacy ANSI encodings (Windows-1252, etc.). Without the BOM, older Windows software may misdetect UTF-8 files as ANSI and display garbled text.
  • CSV files opened in Excel: Excel on Windows often needs the BOM to correctly interpret UTF-8 CSV files. Without it, non-ASCII characters become mojibake.
  • Mixed-encoding environments: In workflows where files might be UTF-8 or Latin-1, the BOM provides an unambiguous marker.

When the UTF-8 BOM Hurts

  • Unix/Linux/macOS shells: The BOM is three invisible bytes at the start of a file. Shell scripts starting with #!/bin/bash will fail if the BOM precedes the shebang — the kernel reads EF BB BF 23 21 instead of 23 21 (#!) and does not recognize the interpreter directive.
  • PHP: If a PHP file has a BOM, those three bytes are sent to the browser before any output, breaking header() calls and session management (which require no prior output).
  • JSON: RFC 8259 states that JSON text "MUST NOT" begin with a BOM, though many parsers tolerate it.
  • XML: The XML specification allows a BOM, but if a BOM is present and contradicts the encoding declared in the XML declaration, parsers may reject the document.
  • Concatenation: When you concatenate multiple UTF-8 files, each file's BOM ends up in the middle of the resulting file, appearing as a zero-width no-break space — invisible but semantically wrong.
  • diff/version control: BOMs at the start of source code files can cause spurious diffs and merge conflicts.

The Industry Consensus

The Unicode Standard itself says: "Use of a BOM is neither required nor recommended for UTF-8." The W3C, WHATWG, and most modern specifications agree. The practical guideline:

Do not add a BOM to UTF-8 files unless you have a specific reason (typically Windows Excel compatibility).

UTF-16 and the BOM: Where It Matters Most

UTF-16 is where the BOM serves its original and most important purpose. Because UTF-16 uses 16-bit code units, byte order is ambiguous without a declaration. The BOM resolves this:

FE FF → Big Endian    (UTF-16BE)
FF FE → Little Endian (UTF-16LE)

When an explicit encoding label is present (e.g., an HTTP Content-Type: text/plain; charset=utf-16le header or a file format that specifies endianness), the BOM is optional. When there is no external metadata — for instance, a bare .txt file — the BOM is the primary mechanism for detecting byte order.

If a UTF-16 stream has no BOM and no external encoding declaration, the Unicode Standard says to assume big-endian. In practice, however, Windows software predominantly uses little-endian, which is why many implementations default to UTF-16LE.

UTF-32 and the BOM

The same logic applies to UTF-32, using 32-bit code units:

00 00 FE FF → Big Endian    (UTF-32BE)
FF FE 00 00 → Little Endian (UTF-32LE)

UTF-32 is rare in the wild (files are four times the size of ASCII), so UTF-32 BOM issues seldom arise in practice.

Detecting and Handling BOMs in Code

Python

Python's codecs module and the built-in open() function handle BOMs through specific codec names:

# Reading a file with a UTF-8 BOM
with open("file.txt", encoding="utf-8-sig") as f:
    text = f.read()  # BOM is silently stripped

# Writing a file with a UTF-8 BOM (for Excel compatibility)
with open("report.csv", "w", encoding="utf-8-sig") as f:
    f.write("Name,Amount\n")
    f.write("Cafe,\u20ac12.50\n")  # € sign

# Reading UTF-16 (BOM auto-detected)
with open("file.txt", encoding="utf-16") as f:
    text = f.read()  # Python reads the BOM and determines endianness

# Manually checking for a BOM
raw = open("file.txt", "rb").read(4)
if raw.startswith(b"\xef\xbb\xbf"):
    print("UTF-8 BOM detected")
elif raw.startswith(b"\xff\xfe\x00\x00"):
    print("UTF-32LE BOM detected")
elif raw.startswith(b"\x00\x00\xfe\xff"):
    print("UTF-32BE BOM detected")
elif raw.startswith(b"\xff\xfe"):
    print("UTF-16LE BOM detected")
elif raw.startswith(b"\xfe\xff"):
    print("UTF-16BE BOM detected")
else:
    print("No BOM found")

Key Python codec names:

Codec BOM Handling
utf-8 Does not strip or add BOM
utf-8-sig Strips BOM on read, adds BOM on write
utf-16 Auto-detects byte order via BOM on read; adds BOM on write
utf-16-le / utf-16-be Explicit endianness, no BOM handling
utf-32 Auto-detects via BOM on read; adds BOM on write
utf-32-le / utf-32-be Explicit endianness, no BOM handling

JavaScript / Node.js

Node.js does not automatically strip the UTF-8 BOM:

const fs = require("fs");

let text = fs.readFileSync("file.txt", "utf8");

// Strip UTF-8 BOM if present
if (text.charCodeAt(0) === 0xFEFF) {
    text = text.slice(1);
}

Shell (Bash/Zsh)

Detecting and removing a UTF-8 BOM from the command line:

# Check if a file has a UTF-8 BOM
hexdump -C file.txt | head -1
# If it starts with "ef bb bf", there's a BOM

# Remove BOM using sed
sed -i '1s/^\xEF\xBB\xBF//' file.txt

# Remove BOM from all .csv files in a directory
for f in *.csv; do
    sed -i '1s/^\xEF\xBB\xBF//' "$f"
done

C# / .NET

// Reading — StreamReader strips UTF-8 BOM by default
using var reader = new StreamReader("file.txt", Encoding.UTF8);
string text = reader.ReadToEnd();

// Writing without BOM
using var writer = new StreamWriter("file.txt", false, new UTF8Encoding(false));
writer.Write("No BOM here");

// Writing with BOM
using var writer2 = new StreamWriter("file.txt", false, new UTF8Encoding(true));
writer2.Write("BOM included");

Common BOM Problems and Solutions

Problem Cause Solution
 appears at start of file UTF-8 BOM (EF BB BF) interpreted as Latin-1 Open with UTF-8 encoding; strip BOM
PHP headers already sent BOM bytes sent before <?php Save PHP files without BOM
JSON parse error BOM violates RFC 8259 Use utf-8 encoding, not utf-8-sig
Shell script command not found BOM before #!/bin/bash Remove BOM from scripts
CSV garbled in Excel Missing BOM for Windows Excel Write with utf-8-sig encoding
Extra blank line in HTML BOM rendered as invisible character Remove BOM from templates
Git diff shows ^M or BOM changes BOM added/removed inconsistently Standardize with .gitattributes or .editorconfig

Editor Configuration

Most modern editors let you control BOM behavior:

Editor BOM Setting
VS Code Bottom status bar → "UTF-8" click → "Save with Encoding" → choose "UTF-8" (no BOM) or "UTF-8 with BOM"
Sublime Text File → Save with Encoding → UTF-8 (default: no BOM)
Notepad++ Encoding menu → "UTF-8" (no BOM) or "UTF-8-BOM"
Vim :set nobomb to remove, :set bomb to add; :set fileencoding=utf-8
JetBrains IDEs File → File Properties → Remove BOM
Windows Notepad Since Windows 10 v1903, defaults to UTF-8 without BOM for new files

You can also enforce BOM policy project-wide with .editorconfig:

[*]
charset = utf-8

[*.csv]
charset = utf-8-bom

U+FFFE: The Anti-BOM

The character U+FFFE is a noncharacter — permanently reserved and never assigned. Its existence is deliberate: if a reader encounters FF FE at the start of a UTF-16 stream, it knows the byte order is little-endian (reading U+FEFF correctly). But if it encounters FE FF where FF FE was expected (or vice versa), it knows the bytes are swapped.

The pair U+FEFF / U+FFFE is the mechanism that makes BOM detection reliable. Because U+FFFE is guaranteed to never be a valid character, seeing it in the "character" position is an unambiguous signal that the byte order has been misread.

Key Takeaways

  • The Byte Order Mark (U+FEFF) indicates byte order in UTF-16 and UTF-32, and serves as an encoding signature in UTF-8.
  • UTF-16 and UTF-32 genuinely need the BOM (or an external declaration) to resolve byte-order ambiguity.
  • UTF-8 does not need a BOM — it has no byte-order ambiguity. The Unicode Standard says the UTF-8 BOM is "neither required nor recommended."
  • Windows Excel is the primary reason UTF-8 BOMs persist — use utf-8-sig in Python when generating CSV files for Excel users.
  • The BOM causes real problems in shell scripts, PHP, JSON, and file concatenation. Avoid it unless you have a compelling reason to include it.
  • Use Python's utf-8-sig codec to transparently handle BOMs, and Node.js's charCodeAt(0) === 0xFEFF check to strip them manually.
  • Configure your editor and .editorconfig to enforce a consistent BOM policy across your project.

Mais em Unicode Fundamentals

What is Unicode? A Complete Guide

Unicode is the universal character encoding standard that assigns a unique number …

UTF-8 Encoding Explained

UTF-8 is the dominant character encoding on the web, capable of representing …

UTF-8 vs UTF-16 vs UTF-32: When to Use Each

UTF-8, UTF-16, and UTF-32 are three encodings of Unicode, each with different …

What is a Unicode Code Point?

A Unicode code point is the unique number assigned to each character …

Unicode Planes and the BMP

Unicode is divided into 17 planes, each containing up to 65,536 code …

Surrogate Pairs Explained

Surrogate pairs are a mechanism in UTF-16 that allows code points outside …

ASCII to Unicode: The Evolution of Character Encoding

ASCII defined 128 characters for the English alphabet and was the foundation …

Unicode Normalization: NFC, NFD, NFKC, NFKD

The same visible character can be represented by multiple different byte sequences …

The Unicode Bidirectional Algorithm

The Unicode Bidirectional Algorithm (UBA) determines how text containing a mix of …

Unicode General Categories Explained

Every Unicode character belongs to a general category such as Letter, Number, …

Understanding Unicode Blocks

Unicode blocks are contiguous ranges of code points grouped by script or …

Unicode Scripts: How Writing Systems are Organized

Unicode assigns every character to a script property that identifies the writing …

What are Combining Characters?

Combining characters are Unicode code points that attach to a preceding base …

Grapheme Clusters vs Code Points

A single visible character on screen — called a grapheme cluster — …

Unicode Confusables: A Security Guide

Unicode confusables are characters that look identical or nearly identical to others, …

Zero Width Characters: What They Are and Why They Matter

Zero-width characters are invisible Unicode code points that affect text layout, joining, …

Unicode Whitespace Characters Guide

Unicode defines over two dozen whitespace characters beyond the ordinary space, including …

History of Unicode

Unicode began in 1987 as a collaboration between engineers at Apple and …

Unicode Versions Timeline

Unicode has released major versions regularly since 1.0 in 1991, with each …