🔧 Practical Unicode

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings and normalization forms, causing cross-platform file sharing problems especially between macOS (NFD) and Linux (NFC). This guide explains how Unicode filenames work on Windows, macOS, and Linux, and how to safely handle them in Python, shell scripts, and other tools.

Published 2024-02-05 · Updated 2024-11-19

Unicode filenames are a source of subtle, maddening bugs. A file created on macOS may be invisible to a script on Linux. A ZIP archive from Windows may extract with garbled names on other platforms. Two files that look identical in a directory listing may actually have different byte sequences in their names. This guide explains why, and how to handle Unicode filenames safely across operating systems.

Filesystem Encoding Basics

Every filesystem stores filenames as byte sequences. The question is: what encoding maps those bytes to characters?

Filesystem	OS	Encoding	Normalization
NTFS	Windows	UTF-16LE	None (preserves as-is)
APFS	macOS	UTF-8	NFD-like (but not strict NFD)
HFS+	macOS (legacy)	UTF-8	Strict NFD
ext4	Linux	Bytes (usually UTF-8)	None
Btrfs	Linux	Bytes (usually UTF-8)	None
ZFS	Linux/BSD	Bytes (usually UTF-8)	Optional (via `normalization` property)
FAT32	All	OEM code page + UTF-16 LFN	None
exFAT	All	UTF-16LE	None

The critical distinction is between filesystems that enforce an encoding (NTFS, APFS) and those that treat filenames as opaque byte sequences (ext4, Btrfs).

The NFC vs NFD Problem

The most common Unicode filename issue involves normalization forms. The character "e\u0301" (e with acute accent) can be represented two ways:

Form	Code Points	Bytes (UTF-8)	Description
NFC (Composed)	U+00E9	`C3 A9` (2 bytes)	Single precomposed character
NFD (Decomposed)	U+0065 U+0301	`65 CC 81` (3 bytes)	Base letter + combining accent

Both look identical on screen, but they are different byte sequences. This matters because:

macOS (APFS / HFS+)

macOS automatically converts filenames to a decomposed form similar to NFD. When you create a file named cafe\u0301.txt (NFC, 2 bytes for e\u0301), macOS stores it as cafe\u0301.txt (NFD-like, 3 bytes for e + combining accent).

On APFS (macOS 10.13+), the normalization is slightly different from strict NFD -- Apple calls it "canonically decomposed" but it does not decompose certain characters that strict NFD would. The practical impact is small but can cause issues with edge cases.

On HFS+ (legacy macOS), strict NFD is enforced.

Linux (ext4)

Linux stores whatever bytes you give it. If you create cafe\u0301.txt in NFC, it stays NFC. If you create cafe\u0301.txt in NFD, it stays NFD. Linux does not normalize.

Windows (NTFS)

NTFS preserves filenames exactly as created, with no normalization. It stores filenames as UTF-16LE.

The Cross-Platform Problem

Consider this scenario:

On macOS, you create cafe\u0301.txt. macOS stores it in NFD form.
You push it to a Git repository.
On Linux, you clone the repo. The filename is now in NFD form: cafe\u0301.txt.
A Python script does os.path.exists("cafe\u0301.txt") using NFC -- this returns False on Linux because the bytes do not match.

The same file, the same visible name, but different bytes -- and the comparison fails.

Detecting Normalization Issues

Python

import unicodedata
import os

filename = "cafe\u0301.txt"

# Check what form a string is in
nfc = unicodedata.normalize("NFC", filename)
nfd = unicodedata.normalize("NFD", filename)

print(f"Original bytes: {filename.encode('utf-8').hex()}")
print(f"NFC bytes:      {nfc.encode('utf-8').hex()}")
print(f"NFD bytes:      {nfd.encode('utf-8').hex()}")
print(f"NFC == NFD: {nfc == nfd}")  # False!
print(f"Length NFC: {len(nfc)}, Length NFD: {len(nfd)}")

# Safe file existence check: normalize before comparing
def safe_exists(path: str) -> bool:
    directory = os.path.dirname(path) or "."
    target = unicodedata.normalize("NFC", os.path.basename(path))
    for entry in os.listdir(directory):
        if unicodedata.normalize("NFC", entry) == target:
            return True
    return False

Command Line

# Show the actual bytes of a filename
ls cafe*.txt | xxd | head

# Python one-liner to check normalization
python3 -c "
import os, unicodedata
for f in os.listdir('.'):
    nfc = unicodedata.normalize('NFC', f)
    if f != nfc:
        print(f'NFD detected: {f!r}')
"

Git and Unicode Filenames

Git has a configuration option that controls how it handles Unicode normalization in filenames:

# Check current setting
git config core.precomposeunicode

# macOS: set to true (default on macOS)
# This tells Git to precompose (NFC) filenames on checkout
git config --global core.precomposeunicode true

# Linux: usually not set (Git preserves whatever is in the repo)

The Git Normalization Problem

If a macOS user commits a file with an NFD filename and a Linux user clones the repo, the Linux user gets the NFD filename. This can cause:

Build scripts that reference the NFC filename to fail
import statements in Python to fail if the module filename is NFD
Duplicate-looking files if someone creates a new file with the NFC name

Prevention

Normalize filenames before committing:

# Install convmv (filename encoding converter)
sudo apt install convmv   # Debian/Ubuntu
brew install convmv        # macOS

# Preview NFD-to-NFC conversion (dry run)
convmv -r -f utf-8 -t utf-8 --nfc --notest .

# Actually rename
convmv -r -f utf-8 -t utf-8 --nfc .

Platform-Specific Filename Restrictions

Beyond encoding, each OS has different rules for valid characters in filenames:

Windows (NTFS)

Restriction	Details
Forbidden characters	`< > : " / \\ \| ? *`
Forbidden names	CON, PRN, AUX, NUL, COM1-9, LPT1-9
Trailing dots/spaces	Silently stripped
Max path length	260 characters (default), 32,767 with long path enabled
Case	Preserving but insensitive

macOS (APFS)

Restriction	Details
Forbidden characters	`:` (displayed as `/` in Finder), NUL
Max filename length	255 UTF-8 bytes
Max path length	1024 bytes
Case	Preserving, insensitive by default (configurable)

Linux (ext4)

Restriction	Details
Forbidden characters	`/` and NUL only
Max filename length	255 bytes
Max path length	4096 bytes
Case	Sensitive

The 255-Byte Limit

The 255-byte limit on ext4 and APFS is measured in bytes, not characters. A filename using only ASCII characters can be 255 characters long. A filename using 4-byte emoji can only be 63 characters long before hitting the byte limit.

# Check if a filename is safe for ext4/APFS
def is_filename_safe(name: str) -> bool:
    encoded = name.encode("utf-8")
    return len(encoded) <= 255

ZIP Archives and Unicode

ZIP archives have a long history of encoding problems because the original ZIP format (1989) predates Unicode adoption. Filename encoding depends on the ZIP creator:

Creator	Filename Encoding
Windows Explorer (modern)	UTF-8 with language encoding flag (bit 11)
Windows Explorer (old)	OEM code page (CP437 or system locale)
macOS Archive Utility	UTF-8
Info-ZIP (Linux)	UTF-8 (usually)
7-Zip	UTF-8 with flag
WinRAR	UTF-8 with flag
Java `ZipOutputStream`	UTF-8 if explicitly set, otherwise platform default

Handling ZIP Encoding in Python

import zipfile

# Reading: Python 3.11+ handles UTF-8 flag automatically
with zipfile.ZipFile("archive.zip", "r") as zf:
    for info in zf.infolist():
        # Check if UTF-8 flag is set
        if info.flag_bits & 0x800:
            # Filename is UTF-8
            print(f"UTF-8: {info.filename}")
        else:
            # Filename might be CP437 or another code page
            # Try decoding as CP437 (ZIP default), then as system locale
            raw = info.filename.encode("cp437")
            try:
                name = raw.decode("utf-8")
            except UnicodeDecodeError:
                name = raw.decode("cp949", errors="replace")  # Korean
            print(f"Recoded: {name}")

Cross-Platform Safe Filenames

If you need filenames that work everywhere, follow these rules:

import re
import unicodedata

def safe_filename(name: str) -> str:
    # Normalize to NFC
    name = unicodedata.normalize("NFC", name)
    # Remove characters forbidden on Windows
    name = re.sub(r'[<>:"/\\|?*]', "_", name)
    # Remove control characters
    name = re.sub(r"[\x00-\x1f\x7f]", "", name)
    # Strip trailing dots and spaces (Windows issue)
    name = name.strip(". ")
    # Avoid Windows reserved names
    reserved = {"CON", "PRN", "AUX", "NUL"}
    reserved.update(f"COM{i}" for i in range(1, 10))
    reserved.update(f"LPT{i}" for i in range(1, 10))
    stem = name.split(".")[0].upper()
    if stem in reserved:
        name = f"_{name}"
    # Ensure the UTF-8 byte length is within limits
    encoded = name.encode("utf-8")
    if len(encoded) > 240:  # Leave margin below 255
        name = encoded[:240].decode("utf-8", errors="ignore")
    return name or "unnamed"

Best Practices Summary

Practice	Why
Normalize to NFC on creation	Prevents NFC/NFD mismatch across platforms
Normalize to NFC on comparison	Ensures `cafe\u0301` matches `cafe\u0301` regardless of source
Limit to ASCII in automated pipelines	Avoids all encoding/normalization issues
Set `core.precomposeunicode true` in Git	Consistent filenames across macOS and Linux
Check byte length, not character length	255-byte limit is per-byte on ext4/APFS
Use UTF-8 everywhere	Consistent encoding across all platforms
Test with accented characters and emoji	Catches normalization and 4-byte issues early

Unicode in Filenames

Filesystem Encoding Basics

The NFC vs NFD Problem

macOS (APFS / HFS+)

Linux (ext4)

Windows (NTFS)

The Cross-Platform Problem

Detecting Normalization Issues

Python

Command Line

Git and Unicode Filenames

The Git Normalization Problem

Prevention

Platform-Specific Filename Restrictions

Windows (NTFS)

macOS (APFS)

Linux (ext4)

The 255-Byte Limit

ZIP Archives and Unicode

Handling ZIP Encoding in Python

Cross-Platform Safe Filenames

Best Practices Summary

المزيد في Practical Unicode