바이트 순서 표시 (BOM)
텍스트 스트림 앞에 놓여 바이트 순서와 인코딩을 나타내는 U+FEFF. UTF-16/32에서는 필수이며, UTF-8에서는 선택 사항으로 권장되지 않습니다.
What is the Byte Order Mark?
The Byte Order Mark (BOM) is a special Unicode character, U+FEFF (ZERO WIDTH NO-BREAK SPACE in its original meaning, but repurposed as a BOM), placed at the very beginning of a text stream to signal the byte order (endianness) and encoding of the data that follows. When a BOM appears as the first bytes of a file, parsers use it to detect how to interpret the rest of the content.
The BOM is not a visible character. When used as intended (at the start of a file), it communicates metadata to parsers without affecting the rendered text. However, because many parsers do not handle it correctly, the BOM is a frequent source of mysterious bugs — particularly the infamous three-character garble  that appears at the top of pages generated from UTF-8 BOM files processed by software that doesn't strip it.
BOM by Encoding
| Encoding | BOM Bytes | Hex |
|---|---|---|
| UTF-8 | EF BB BF |
EF BB BF |
| UTF-16 BE | FE FF |
FE FF |
| UTF-16 LE | FF FE |
FF FE |
| UTF-32 BE | 00 00 FE FF |
00 00 FE FF |
| UTF-32 LE | FF FE 00 00 |
FF FE 00 00 |
The UTF-32 LE BOM (FF FE 00 00) and UTF-16 LE BOM (FF FE) share the same first two bytes, so parsers must check all 4 bytes to distinguish UTF-32 LE from UTF-16 LE.
When the BOM is Meaningful vs. Optional
UTF-16 and UTF-32: The BOM is essentially mandatory when byte order is not externally specified. Without a BOM, the reader must guess or use a default. The BOM unambiguously resolves this.
UTF-8: The BOM is meaningless for byte-order purposes (UTF-8 is byte-order independent) but is sometimes used to signal "this file is UTF-8." The IANA standard for UTF-8 explicitly discourages the BOM. The Unicode Standard says it "may" be used. In practice: - Avoid UTF-8 BOM for web content, Unix text files, source code, and JSON. - Windows Notepad historically added a UTF-8 BOM; newer versions (Windows 10 1903+) stopped this default. - Some Windows tools (Visual Studio, Excel CSV export) still produce UTF-8 with BOM.
The  Problem
The three bytes EF BB BF (UTF-8 BOM) decoded as Windows-1252 or Latin-1 produce the characters . This is the most common visible manifestation of BOM mishandling. It appears:
- At the top of PHP-generated web pages where the PHP file was saved with a BOM
- At the start of CSV files exported from Microsoft Excel and then parsed naively
- In the
<title>or first text node of HTML files saved from Windows text editors
# Python: BOM handling
with open('bomfile.txt', 'r', encoding='utf-8-sig') as f:
# 'utf-8-sig' strips the BOM on read, adds it on write
content = f.read()
# Without 'utf-8-sig', the BOM becomes character U+FEFF in the string
with open('bomfile.txt', 'r', encoding='utf-8') as f:
content = f.read()
if content.startswith('\ufeff'):
content = content[1:] # Manual BOM strip
# Detecting BOM in binary data
with open('unknown.txt', 'rb') as f:
raw = f.read(4)
if raw[:3] == b'\xef\xbb\xbf':
print('UTF-8 with BOM')
elif raw[:2] == b'\xff\xfe' and raw[2:4] != b'\x00\x00':
print('UTF-16 LE')
elif raw[:2] == b'\xfe\xff':
print('UTF-16 BE')
elif raw[:4] == b'\xff\xfe\x00\x00':
print('UTF-32 LE')
elif raw[:4] == b'\x00\x00\xfe\xff':
print('UTF-32 BE')
Quick Facts
| Property | Value |
|---|---|
| Unicode code point | U+FEFF |
| Character name | ZERO WIDTH NO-BREAK SPACE (original) / BOM (as used) |
| UTF-8 bytes | EF BB BF |
| UTF-16 LE bytes | FF FE |
| UTF-16 BE bytes | FE FF |
| Rendered width | Zero (invisible when used as BOM) |
| Python codec | utf-8-sig strips/adds BOM automatically |
Common Pitfalls
PHP and the BOM. Any whitespace (including a BOM) before the <?php opening tag is sent as output, preventing HTTP headers from being set. The error message "headers already sent" in PHP is often caused by a BOM in a PHP source file. Fix: save PHP files as UTF-8 without BOM.
JSON files with BOM. The JSON specification (RFC 8259) explicitly prohibits a BOM at the start of a JSON document: "JSON text exchanged between systems that are not part of a closed ecosystem MUST be encoded using UTF-8. Implementations MUST NOT add a byte order mark (U+FEFF) to the beginning." Many JSON parsers reject BOM-prefixed JSON.
U+FEFF as ZWNBSP inside text. U+FEFF appears in the middle of text (not just at the start) as a deprecated ZERO WIDTH NO-BREAK SPACE. Modern Unicode uses U+2060 (WORD JOINER) for this purpose. U+FEFF in the middle of a string is considered a formatting character that may cause invisible word-joining behavior.
관련 용어
인코딩의 더 많은 용어
미국 정보 교환 표준 부호. 0~127의 128개 문자를 다루는 7비트 인코딩으로, 제어 …
Visual art created from text characters, originally limited to the 95 printable …
Binary-to-text encoding that represents binary data using 64 ASCII characters (A–Z, a–z, …
주로 대만과 홍콩에서 사용되는 번체 중국어 문자 인코딩으로, 약 13,000개의 CJK 문자를 …
확장 이진화 십진법 교환 부호. 문자 범위가 연속적이지 않은 IBM 메인프레임 인코딩으로, …
KS X 1001 기반의 한국어 문자 인코딩으로, 한글 음절과 한자를 2바이트 시퀀스에 …
간체 중국어 문자 인코딩 체계: GB2312(6,763자)에서 GBK를 거쳐 GB18030으로 발전하였으며, 유니코드와 호환되는 …
IANA가 관리하는 문자 인코딩 이름의 공식 레지스트리로, HTTP Content-Type 헤더와 MIME에서 사용됩니다(예: …
서로 다른 언어권을 위한 8비트 단일 바이트 인코딩 모음. ISO 8859-1(Latin-1)은 유니코드 …
단일 바이트 ASCII/JIS 로만과 이중 바이트 JIS X 0208 한자를 결합한 일본어 …