Unicode in Filenames
Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems especially between macOS (NFD) and Linux (NFC). This guide explains how Unicode filenames work on Windows, macOS, and Linux, and how to safely handle them in Python, shell scripts, and other tools.
Unicode filenames are a source of subtle, maddening bugs. A file created on macOS may be invisible to a script on Linux. A ZIP archive from Windows may extract with garbled names on other platforms. Two files that look identical in a directory listing may actually have different byte sequences in their names. This guide explains why, and how to handle Unicode filenames safely across operating systems.
Filesystem Encoding Basics
Every filesystem stores filenames as byte sequences. The question is: what encoding maps those bytes to characters?
| Filesystem | OS | Encoding | Normalization |
|---|---|---|---|
| NTFS | Windows | UTF-16LE | None (preserves as-is) |
| APFS | macOS | UTF-8 | NFD-like (but not strict NFD) |
| HFS+ | macOS (legacy) | UTF-8 | Strict NFD |
| ext4 | Linux | Bytes (usually UTF-8) | None |
| Btrfs | Linux | Bytes (usually UTF-8) | None |
| ZFS | Linux/BSD | Bytes (usually UTF-8) | Optional (via normalization property) |
| FAT32 | All | OEM code page + UTF-16 LFN | None |
| exFAT | All | UTF-16LE | None |
The critical distinction is between filesystems that enforce an encoding (NTFS, APFS) and those that treat filenames as opaque byte sequences (ext4, Btrfs).
The NFC vs NFD Problem
The most common Unicode filename issue involves normalization forms. The character "e\u0301" (e with acute accent) can be represented two ways:
| Form | Code Points | Bytes (UTF-8) | Description |
|---|---|---|---|
| NFC (Composed) | U+00E9 | C3 A9 (2 bytes) |
Single precomposed character |
| NFD (Decomposed) | U+0065 U+0301 | 65 CC 81 (3 bytes) |
Base letter + combining accent |
Both look identical on screen, but they are different byte sequences. This matters because:
macOS (APFS / HFS+)
macOS automatically converts filenames to a decomposed form similar to NFD. When you
create a file named cafe\u0301.txt (NFC, 2 bytes for e\u0301), macOS stores it as
cafe\u0301.txt (NFD-like, 3 bytes for e + combining accent).
On APFS (macOS 10.13+), the normalization is slightly different from strict NFD -- Apple calls it "canonically decomposed" but it does not decompose certain characters that strict NFD would. The practical impact is small but can cause issues with edge cases.
On HFS+ (legacy macOS), strict NFD is enforced.
Linux (ext4)
Linux stores whatever bytes you give it. If you create cafe\u0301.txt in NFC, it stays NFC.
If you create cafe\u0301.txt in NFD, it stays NFD. Linux does not normalize.
Windows (NTFS)
NTFS preserves filenames exactly as created, with no normalization. It stores filenames as UTF-16LE.
The Cross-Platform Problem
Consider this scenario:
- On macOS, you create
cafe\u0301.txt. macOS stores it in NFD form. - You push it to a Git repository.
- On Linux, you clone the repo. The filename is now in NFD form:
cafe\u0301.txt. - A Python script does
os.path.exists("cafe\u0301.txt")using NFC -- this returnsFalseon Linux because the bytes do not match.
The same file, the same visible name, but different bytes -- and the comparison fails.
Detecting Normalization Issues
Python
import unicodedata
import os
filename = "cafe\u0301.txt"
# Check what form a string is in
nfc = unicodedata.normalize("NFC", filename)
nfd = unicodedata.normalize("NFD", filename)
print(f"Original bytes: {filename.encode('utf-8').hex()}")
print(f"NFC bytes: {nfc.encode('utf-8').hex()}")
print(f"NFD bytes: {nfd.encode('utf-8').hex()}")
print(f"NFC == NFD: {nfc == nfd}") # False!
print(f"Length NFC: {len(nfc)}, Length NFD: {len(nfd)}")
# Safe file existence check: normalize before comparing
def safe_exists(path: str) -> bool:
directory = os.path.dirname(path) or "."
target = unicodedata.normalize("NFC", os.path.basename(path))
for entry in os.listdir(directory):
if unicodedata.normalize("NFC", entry) == target:
return True
return False
Command Line
# Show the actual bytes of a filename
ls cafe*.txt | xxd | head
# Python one-liner to check normalization
python3 -c "
import os, unicodedata
for f in os.listdir('.'):
nfc = unicodedata.normalize('NFC', f)
if f != nfc:
print(f'NFD detected: {f!r}')
"
Git and Unicode Filenames
Git has a configuration option that controls how it handles Unicode normalization in filenames:
# Check current setting
git config core.precomposeunicode
# macOS: set to true (default on macOS)
# This tells Git to precompose (NFC) filenames on checkout
git config --global core.precomposeunicode true
# Linux: usually not set (Git preserves whatever is in the repo)
The Git Normalization Problem
If a macOS user commits a file with an NFD filename and a Linux user clones the repo, the Linux user gets the NFD filename. This can cause:
- Build scripts that reference the NFC filename to fail
importstatements in Python to fail if the module filename is NFD- Duplicate-looking files if someone creates a new file with the NFC name
Prevention
Normalize filenames before committing:
# Install convmv (filename encoding converter)
sudo apt install convmv # Debian/Ubuntu
brew install convmv # macOS
# Preview NFD-to-NFC conversion (dry run)
convmv -r -f utf-8 -t utf-8 --nfc --notest .
# Actually rename
convmv -r -f utf-8 -t utf-8 --nfc .
Platform-Specific Filename Restrictions
Beyond encoding, each OS has different rules for valid characters in filenames:
Windows (NTFS)
| Restriction | Details |
|---|---|
| Forbidden characters | < > : " / \\ | ? * |
| Forbidden names | CON, PRN, AUX, NUL, COM1-9, LPT1-9 |
| Trailing dots/spaces | Silently stripped |
| Max path length | 260 characters (default), 32,767 with long path enabled |
| Case | Preserving but insensitive |
macOS (APFS)
| Restriction | Details |
|---|---|
| Forbidden characters | : (displayed as / in Finder), NUL |
| Max filename length | 255 UTF-8 bytes |
| Max path length | 1024 bytes |
| Case | Preserving, insensitive by default (configurable) |
Linux (ext4)
| Restriction | Details |
|---|---|
| Forbidden characters | / and NUL only |
| Max filename length | 255 bytes |
| Max path length | 4096 bytes |
| Case | Sensitive |
The 255-Byte Limit
The 255-byte limit on ext4 and APFS is measured in bytes, not characters. A filename using only ASCII characters can be 255 characters long. A filename using 4-byte emoji can only be 63 characters long before hitting the byte limit.
# Check if a filename is safe for ext4/APFS
def is_filename_safe(name: str) -> bool:
encoded = name.encode("utf-8")
return len(encoded) <= 255
ZIP Archives and Unicode
ZIP archives have a long history of encoding problems because the original ZIP format (1989) predates Unicode adoption. Filename encoding depends on the ZIP creator:
| Creator | Filename Encoding |
|---|---|
| Windows Explorer (modern) | UTF-8 with language encoding flag (bit 11) |
| Windows Explorer (old) | OEM code page (CP437 or system locale) |
| macOS Archive Utility | UTF-8 |
| Info-ZIP (Linux) | UTF-8 (usually) |
| 7-Zip | UTF-8 with flag |
| WinRAR | UTF-8 with flag |
Java ZipOutputStream |
UTF-8 if explicitly set, otherwise platform default |
Handling ZIP Encoding in Python
import zipfile
# Reading: Python 3.11+ handles UTF-8 flag automatically
with zipfile.ZipFile("archive.zip", "r") as zf:
for info in zf.infolist():
# Check if UTF-8 flag is set
if info.flag_bits & 0x800:
# Filename is UTF-8
print(f"UTF-8: {info.filename}")
else:
# Filename might be CP437 or another code page
# Try decoding as CP437 (ZIP default), then as system locale
raw = info.filename.encode("cp437")
try:
name = raw.decode("utf-8")
except UnicodeDecodeError:
name = raw.decode("cp949", errors="replace") # Korean
print(f"Recoded: {name}")
Cross-Platform Safe Filenames
If you need filenames that work everywhere, follow these rules:
import re
import unicodedata
def safe_filename(name: str) -> str:
# Normalize to NFC
name = unicodedata.normalize("NFC", name)
# Remove characters forbidden on Windows
name = re.sub(r'[<>:"/\\|?*]', "_", name)
# Remove control characters
name = re.sub(r"[\x00-\x1f\x7f]", "", name)
# Strip trailing dots and spaces (Windows issue)
name = name.strip(". ")
# Avoid Windows reserved names
reserved = {"CON", "PRN", "AUX", "NUL"}
reserved.update(f"COM{i}" for i in range(1, 10))
reserved.update(f"LPT{i}" for i in range(1, 10))
stem = name.split(".")[0].upper()
if stem in reserved:
name = f"_{name}"
# Ensure the UTF-8 byte length is within limits
encoded = name.encode("utf-8")
if len(encoded) > 240: # Leave margin below 255
name = encoded[:240].decode("utf-8", errors="ignore")
return name or "unnamed"
Best Practices Summary
| Practice | Why |
|---|---|
| Normalize to NFC on creation | Prevents NFC/NFD mismatch across platforms |
| Normalize to NFC on comparison | Ensures cafe\u0301 matches cafe\u0301 regardless of source |
| Limit to ASCII in automated pipelines | Avoids all encoding/normalization issues |
Set core.precomposeunicode true in Git |
Consistent filenames across macOS and Linux |
| Check byte length, not character length | 255-byte limit is per-byte on ext4/APFS |
| Use UTF-8 everywhere | Consistent encoding across all platforms |
| Test with accented characters and emoji | Catches normalization and 4-byte issues early |
المزيد في Practical Unicode
Windows provides several methods for typing special characters and Unicode symbols, including …
macOS makes it easy to type special characters and Unicode symbols through …
Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …
Typing special Unicode characters on smartphones requires different techniques than on desktop …
Mojibake is the garbled text you see when a file encoded in …
Storing Unicode text in a database requires choosing the right charset, collation, …
Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, …
Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …
Using Unicode symbols, special characters, and emoji in web content has important …
Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …
A font file only contains glyphs for a subset of Unicode characters, …
Finding the exact Unicode character you need can be challenging given over …
Copying and pasting text between applications can introduce invisible characters, change normalization …
Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …