🖥️ Platform Guides

Unicode in Data Science and NLP

Natural language processing and data science pipelines frequently encounter Unicode issues including encoding errors, normalization mismatches, invisible characters, and language detection challenges. This guide addresses Unicode challenges specific to data science and NLP, covering pandas, text preprocessing, tokenization, and multilingual datasets.

Published 2024-07-08 · Updated 2025-05-05

Data science workflows routinely process text from diverse sources — web scraping, user-generated content, multilingual datasets, social media feeds, and API responses. Unicode issues are among the most common sources of bugs and data quality problems in these pipelines. This guide covers how to handle Unicode correctly in pandas, NLP tokenization, web scraping, and data cleaning — the core operations of any data science workflow.

Unicode in pandas

pandas is the workhorse of data manipulation in Python. Its string handling has improved significantly, but Unicode pitfalls remain.

Reading data with correct encoding

The most common Unicode problem in pandas is reading CSV files with the wrong encoding:

import pandas as pd

# Default: assumes UTF-8
df = pd.read_csv("data.csv")

# Specify encoding explicitly
df = pd.read_csv("data.csv", encoding="utf-8")

# Common alternatives for non-UTF-8 files
df = pd.read_csv("data.csv", encoding="latin-1")      # ISO-8859-1
df = pd.read_csv("data.csv", encoding="cp1252")        # Windows-1252
df = pd.read_csv("data.csv", encoding="shift_jis")     # Japanese
df = pd.read_csv("data.csv", encoding="euc-kr")        # Korean
df = pd.read_csv("data.csv", encoding="gb2312")        # Chinese (Simplified)

Detecting encoding

When you do not know the encoding, use chardet or charset-normalizer:

import chardet

with open("data.csv", "rb") as f:
    raw = f.read(10000)
    result = chardet.detect(raw)
    print(result)
    # {'encoding': 'Shift_JIS', 'confidence': 0.99, 'language': 'Japanese'}

df = pd.read_csv("data.csv", encoding=result["encoding"])

String operations and Unicode

pandas string methods (.str accessor) work on Unicode strings, but some operations have Unicode-specific behavior:

Operation	Unicode Consideration
`str.len()`	Counts code points, not grapheme clusters
`str.upper()`	Unicode-aware (handles accented chars)
`str.lower()`	Unicode-aware, but locale-dependent for some scripts
`str.contains()`	Regex is Unicode-aware by default in Python
`str.replace()`	Works on code points
`str.normalize()`	Apply NFC/NFD/NFKC/NFKD normalization

Normalization in pandas

Unicode normalization is essential for consistent text comparison:

import unicodedata

# Normalize a column to NFC
df["name_normalized"] = df["name"].str.normalize("NFC")

# Or using a lambda for more control
df["clean_name"] = df["name"].apply(
    lambda x: unicodedata.normalize("NFKC", x) if isinstance(x, str) else x
)

Common normalization scenarios:

Original	NFC	NFKC	Issue
e + combining accent	precomposed e	precomposed e	Combining vs precomposed
fullwidth A	fullwidth A	normal A	Fullwidth compatibility
fi ligature	fi ligature	fi (two chars)	Ligature decomposition
superscript 2	superscript 2	normal 2	Compatibility decomposition

For data science, NFKC is usually the best choice — it normalizes both canonical and compatibility differences, mapping fullwidth characters, ligatures, and superscripts to their standard equivalents.

Unicode in NLP and Tokenization

Natural Language Processing (NLP) tasks are deeply affected by Unicode. Tokenization — splitting text into meaningful units — must account for the properties of different scripts.

Whitespace tokenization pitfalls

The simplest tokenizer (splitting on whitespace) fails for many scripts:

Script	Whitespace Tokenization	Problem
English	"Hello world" -> ["Hello", "world"]	Works
Chinese	"Unicode" -> ["Unicode"]	No spaces between words
Japanese	"Unicode" -> ["Unicode"]	No spaces between words
Thai	"Unicode" -> ["Unicode"]	No spaces between words
German	"Donaudampfschifffahrt" -> 1 token	Compound words

Subword tokenization (BPE, WordPiece, SentencePiece)

Modern NLP models use subword tokenizers that handle Unicode by operating on byte or character sequences:

Tokenizer	Unicode Handling	Used By
BPE (Byte-Pair Encoding)	Merges frequent byte pairs	GPT-2, RoBERTa
BBPE (Byte-level BPE)	Operates on raw UTF-8 bytes	GPT-3, GPT-4
WordPiece	Merges frequent character sequences	BERT
SentencePiece	Language-agnostic, treats input as raw text	T5, mBART
Unigram	Probabilistic subword segmentation	XLNet

Byte-level BPE (BBPE)

BBPE tokenizers (used by GPT models) treat input as a sequence of UTF-8 bytes rather than Unicode characters. This has important implications:

Character	UTF-8 Bytes	BBPE Tokens
A	0x41	1 token
e with accent	0xC3 0xA9	1-2 tokens
Chinese char	0xE4 0xB8 0xAD	1-3 tokens
Emoji	0xF0 0x9F 0x98 0x80	2-4 tokens

This means non-ASCII characters consume more tokens than ASCII characters, which has cost implications for API usage and context window limits.

Practical tokenization example

# Using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text_en = "Hello, world!"
text_ja = "Unicode"
text_emoji = "Hello! Nice to meet you!"

print(f"English: {len(enc.encode(text_en))} tokens")
print(f"Japanese: {len(enc.encode(text_ja))} tokens")
print(f"Emoji: {len(enc.encode(text_emoji))} tokens")
# English uses fewer tokens per character than CJK or emoji

Unicode in Web Scraping

Web scraping is one of the most common data collection methods in data science, and encoding issues are pervasive.

Detecting page encoding

Web pages declare their encoding in multiple places (in priority order):

Source	Example	Priority
HTTP Content-Type header	`Content-Type: text/html; charset=utf-8`	Highest
HTML meta tag	`<meta charset="UTF-8">`	Second
XML declaration	`<?xml version="1.0" encoding="UTF-8"?>`	For XHTML
BOM	`EF BB BF` at file start	Fallback
Auto-detection	chardet/charset-normalizer	Last resort

Handling encoding with requests and BeautifulSoup

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com")

# requests guesses encoding from HTTP headers
print(response.encoding)  # e.g., 'utf-8' or 'ISO-8859-1'

# apparent_encoding uses chardet for detection
print(response.apparent_encoding)  # e.g., 'utf-8'

# Fix encoding if requests guessed wrong
response.encoding = response.apparent_encoding

# BeautifulSoup handles encoding automatically
soup = BeautifulSoup(response.content, "html.parser")

Common scraping encoding issues

Problem	Cause	Solution
Mojibake (garbled text)	Wrong encoding assumed	Use `apparent_encoding` or detect with chardet
HTML entities not decoded	Using `response.text` as-is	Parse with BeautifulSoup
Mixed encodings on one page	Legacy pages with inconsistent encoding	Process sections separately
Emoji missing	Older encoding (ISO-8859-1) used	Ensure UTF-8 decoding

Data Cleaning for Unicode

Removing unwanted Unicode characters

import re
import unicodedata

def clean_text(text: str) -> str:
    # Remove control characters (except newline, tab)
    text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text)

    # Remove zero-width characters
    text = text.replace("\u200b", "")  # zero-width space
    text = text.replace("\u200c", "")  # zero-width non-joiner
    text = text.replace("\u200d", "")  # zero-width joiner
    text = text.replace("\ufeff", "")  # BOM / zero-width no-break space

    # Normalize whitespace (various Unicode spaces to regular space)
    text = re.sub(r"[\u00a0\u2000-\u200a\u2028\u2029\u202f\u205f\u3000]", " ", text)

    # Normalize Unicode to NFKC
    text = unicodedata.normalize("NFKC", text)

    return text.strip()

Unicode category filtering

Python's unicodedata.category() returns a two-letter category for any character:

Category	Description	Example
Lu	Uppercase letter	A, B, C
Ll	Lowercase letter	a, b, c
Lo	Other letter	Chinese, Japanese, Arabic
Nd	Decimal digit	0-9
Zs	Space separator	Space, NBSP
Cc	Control character	Tab, null
So	Other symbol	Emoji, dingbats
Mn	Nonspacing mark	Combining accents

import unicodedata

def keep_letters_and_digits(text: str) -> str:
    return "".join(
        ch for ch in text
        if unicodedata.category(ch)[0] in ("L", "N", "Z")
    )

Encoding-Aware Data Pipelines

Best practices for data pipelines

Declare encoding at every boundary: Every file read, API call, and database query should explicitly specify UTF-8.
Normalize early: Apply NFKC normalization as the first step after reading text.
Validate encoding: After reading, check for mojibake indicators (e.g., sequences like \xc3\xa9 appearing as literal text instead of being decoded).
Store as UTF-8: Use UTF-8 for all intermediate files (CSV with BOM, Parquet, JSON).
Log encoding metadata: Record the detected encoding of source files so you can debug issues later.

File format comparison for Unicode

Format	Encoding	Unicode Support	Notes
CSV	User-defined	Fragile (no standard)	Use UTF-8 with BOM
Parquet	UTF-8 (built-in)	Excellent	Recommended for pipelines
JSON	UTF-8 (RFC 8259)	Excellent	Good for semi-structured
HDF5	UTF-8	Good	For numerical + text
SQLite	UTF-8 or UTF-16	Excellent	Embedded database
Feather	UTF-8 (Arrow)	Excellent	Fast columnar format

Parquet is the recommended format for data science pipelines because it enforces UTF-8 encoding, compresses well, and supports efficient columnar operations.

Key Takeaways

Always specify encoding explicitly when reading files in pandas. Never rely on default encoding detection.
Apply NFKC normalization early in your pipeline to ensure consistent string comparison and deduplication.
Modern NLP tokenizers (BPE, BBPE) operate on bytes, so non-ASCII characters consume more tokens than ASCII — this affects API costs and context windows.
For web scraping, use response.apparent_encoding or chardet to detect encoding, and parse HTML with BeautifulSoup to handle entities and encoding automatically.
Store intermediate data in Parquet (not CSV) for reliable Unicode handling in data pipelines.

Thêm trong Platform Guides

Unicode in Microsoft Word

Microsoft Word supports the full Unicode character set and provides several methods …

Unicode in Google Docs & Sheets

Google Docs and Sheets use UTF-8 internally and provide a Special Characters …

Unicode in Terminal / Command Line

Modern terminals support Unicode and UTF-8, but correctly displaying all Unicode characters …

Unicode in PDF Documents

PDF supports Unicode text through embedded fonts and ToUnicode maps, but many …

Unicode in Excel

Microsoft Excel stores text in Unicode but has historically struggled with non-Latin …

Unicode in Social Media

Social media platforms handle Unicode text with varying degrees of support, affecting …

Unicode in XML and JSON

Both XML and JSON are defined to use Unicode text, but each …

Unicode in QR Codes

QR codes can encode Unicode text using UTF-8, but many QR code …

Unicode in Passwords: Security Implications

Allowing Unicode characters in passwords increases the keyspace and can improve security, …

← Quay lại Hướng dẫn