🖥️ Platform Guides

Unicode in Data Science and NLP

Natural language processing and data science pipelines frequently encounter Unicode issues including encoding errors, normalization mismatches, invisible characters, and language detection challenges. This guide addresses Unicode challenges specific to data science and NLP, covering pandas, text preprocessing, tokenization, and multilingual datasets.

·

Data science workflows routinely process text from diverse sources — web scraping, user-generated content, multilingual datasets, social media feeds, and API responses. Unicode issues are among the most common sources of bugs and data quality problems in these pipelines. This guide covers how to handle Unicode correctly in pandas, NLP tokenization, web scraping, and data cleaning — the core operations of any data science workflow.

Unicode in pandas

pandas is the workhorse of data manipulation in Python. Its string handling has improved significantly, but Unicode pitfalls remain.

Reading data with correct encoding

The most common Unicode problem in pandas is reading CSV files with the wrong encoding:

import pandas as pd

# Default: assumes UTF-8
df = pd.read_csv("data.csv")

# Specify encoding explicitly
df = pd.read_csv("data.csv", encoding="utf-8")

# Common alternatives for non-UTF-8 files
df = pd.read_csv("data.csv", encoding="latin-1")      # ISO-8859-1
df = pd.read_csv("data.csv", encoding="cp1252")        # Windows-1252
df = pd.read_csv("data.csv", encoding="shift_jis")     # Japanese
df = pd.read_csv("data.csv", encoding="euc-kr")        # Korean
df = pd.read_csv("data.csv", encoding="gb2312")        # Chinese (Simplified)

Detecting encoding

When you do not know the encoding, use chardet or charset-normalizer:

import chardet

with open("data.csv", "rb") as f:
    raw = f.read(10000)
    result = chardet.detect(raw)
    print(result)
    # {'encoding': 'Shift_JIS', 'confidence': 0.99, 'language': 'Japanese'}

df = pd.read_csv("data.csv", encoding=result["encoding"])

String operations and Unicode

pandas string methods (.str accessor) work on Unicode strings, but some operations have Unicode-specific behavior:

Operation Unicode Consideration
str.len() Counts code points, not grapheme clusters
str.upper() Unicode-aware (handles accented chars)
str.lower() Unicode-aware, but locale-dependent for some scripts
str.contains() Regex is Unicode-aware by default in Python
str.replace() Works on code points
str.normalize() Apply NFC/NFD/NFKC/NFKD normalization

Normalization in pandas

Unicode normalization is essential for consistent text comparison:

import unicodedata

# Normalize a column to NFC
df["name_normalized"] = df["name"].str.normalize("NFC")

# Or using a lambda for more control
df["clean_name"] = df["name"].apply(
    lambda x: unicodedata.normalize("NFKC", x) if isinstance(x, str) else x
)

Common normalization scenarios:

Original NFC NFKC Issue
e + combining accent precomposed e precomposed e Combining vs precomposed
fullwidth A fullwidth A normal A Fullwidth compatibility
fi ligature fi ligature fi (two chars) Ligature decomposition
superscript 2 superscript 2 normal 2 Compatibility decomposition

For data science, NFKC is usually the best choice — it normalizes both canonical and compatibility differences, mapping fullwidth characters, ligatures, and superscripts to their standard equivalents.

Unicode in NLP and Tokenization

Natural Language Processing (NLP) tasks are deeply affected by Unicode. Tokenization — splitting text into meaningful units — must account for the properties of different scripts.

Whitespace tokenization pitfalls

The simplest tokenizer (splitting on whitespace) fails for many scripts:

Script Whitespace Tokenization Problem
English "Hello world" -> ["Hello", "world"] Works
Chinese "Unicode" -> ["Unicode"] No spaces between words
Japanese "Unicode" -> ["Unicode"] No spaces between words
Thai "Unicode" -> ["Unicode"] No spaces between words
German "Donaudampfschifffahrt" -> 1 token Compound words

Subword tokenization (BPE, WordPiece, SentencePiece)

Modern NLP models use subword tokenizers that handle Unicode by operating on byte or character sequences:

Tokenizer Unicode Handling Used By
BPE (Byte-Pair Encoding) Merges frequent byte pairs GPT-2, RoBERTa
BBPE (Byte-level BPE) Operates on raw UTF-8 bytes GPT-3, GPT-4
WordPiece Merges frequent character sequences BERT
SentencePiece Language-agnostic, treats input as raw text T5, mBART
Unigram Probabilistic subword segmentation XLNet

Byte-level BPE (BBPE)

BBPE tokenizers (used by GPT models) treat input as a sequence of UTF-8 bytes rather than Unicode characters. This has important implications:

Character UTF-8 Bytes BBPE Tokens
A 0x41 1 token
e with accent 0xC3 0xA9 1-2 tokens
Chinese char 0xE4 0xB8 0xAD 1-3 tokens
Emoji 0xF0 0x9F 0x98 0x80 2-4 tokens

This means non-ASCII characters consume more tokens than ASCII characters, which has cost implications for API usage and context window limits.

Practical tokenization example

# Using tiktoken (OpenAI's tokenizer)
import tiktoken

enc = tiktoken.encoding_for_model("gpt-4")

text_en = "Hello, world!"
text_ja = "Unicode"
text_emoji = "Hello! Nice to meet you!"

print(f"English: {len(enc.encode(text_en))} tokens")
print(f"Japanese: {len(enc.encode(text_ja))} tokens")
print(f"Emoji: {len(enc.encode(text_emoji))} tokens")
# English uses fewer tokens per character than CJK or emoji

Unicode in Web Scraping

Web scraping is one of the most common data collection methods in data science, and encoding issues are pervasive.

Detecting page encoding

Web pages declare their encoding in multiple places (in priority order):

Source Example Priority
HTTP Content-Type header Content-Type: text/html; charset=utf-8 Highest
HTML meta tag <meta charset="UTF-8"> Second
XML declaration <?xml version="1.0" encoding="UTF-8"?> For XHTML
BOM EF BB BF at file start Fallback
Auto-detection chardet/charset-normalizer Last resort

Handling encoding with requests and BeautifulSoup

import requests
from bs4 import BeautifulSoup

response = requests.get("https://example.com")

# requests guesses encoding from HTTP headers
print(response.encoding)  # e.g., 'utf-8' or 'ISO-8859-1'

# apparent_encoding uses chardet for detection
print(response.apparent_encoding)  # e.g., 'utf-8'

# Fix encoding if requests guessed wrong
response.encoding = response.apparent_encoding

# BeautifulSoup handles encoding automatically
soup = BeautifulSoup(response.content, "html.parser")

Common scraping encoding issues

Problem Cause Solution
Mojibake (garbled text) Wrong encoding assumed Use apparent_encoding or detect with chardet
HTML entities not decoded Using response.text as-is Parse with BeautifulSoup
Mixed encodings on one page Legacy pages with inconsistent encoding Process sections separately
Emoji missing Older encoding (ISO-8859-1) used Ensure UTF-8 decoding

Data Cleaning for Unicode

Removing unwanted Unicode characters

import re
import unicodedata

def clean_text(text: str) -> str:
    # Remove control characters (except newline, tab)
    text = re.sub(r"[\x00-\x08\x0b\x0c\x0e-\x1f\x7f-\x9f]", "", text)

    # Remove zero-width characters
    text = text.replace("\u200b", "")  # zero-width space
    text = text.replace("\u200c", "")  # zero-width non-joiner
    text = text.replace("\u200d", "")  # zero-width joiner
    text = text.replace("\ufeff", "")  # BOM / zero-width no-break space

    # Normalize whitespace (various Unicode spaces to regular space)
    text = re.sub(r"[\u00a0\u2000-\u200a\u2028\u2029\u202f\u205f\u3000]", " ", text)

    # Normalize Unicode to NFKC
    text = unicodedata.normalize("NFKC", text)

    return text.strip()

Unicode category filtering

Python's unicodedata.category() returns a two-letter category for any character:

Category Description Example
Lu Uppercase letter A, B, C
Ll Lowercase letter a, b, c
Lo Other letter Chinese, Japanese, Arabic
Nd Decimal digit 0-9
Zs Space separator Space, NBSP
Cc Control character Tab, null
So Other symbol Emoji, dingbats
Mn Nonspacing mark Combining accents
import unicodedata

def keep_letters_and_digits(text: str) -> str:
    return "".join(
        ch for ch in text
        if unicodedata.category(ch)[0] in ("L", "N", "Z")
    )

Encoding-Aware Data Pipelines

Best practices for data pipelines

  1. Declare encoding at every boundary: Every file read, API call, and database query should explicitly specify UTF-8.
  2. Normalize early: Apply NFKC normalization as the first step after reading text.
  3. Validate encoding: After reading, check for mojibake indicators (e.g., sequences like \xc3\xa9 appearing as literal text instead of being decoded).
  4. Store as UTF-8: Use UTF-8 for all intermediate files (CSV with BOM, Parquet, JSON).
  5. Log encoding metadata: Record the detected encoding of source files so you can debug issues later.

File format comparison for Unicode

Format Encoding Unicode Support Notes
CSV User-defined Fragile (no standard) Use UTF-8 with BOM
Parquet UTF-8 (built-in) Excellent Recommended for pipelines
JSON UTF-8 (RFC 8259) Excellent Good for semi-structured
HDF5 UTF-8 Good For numerical + text
SQLite UTF-8 or UTF-16 Excellent Embedded database
Feather UTF-8 (Arrow) Excellent Fast columnar format

Parquet is the recommended format for data science pipelines because it enforces UTF-8 encoding, compresses well, and supports efficient columnar operations.

Key Takeaways

  • Always specify encoding explicitly when reading files in pandas. Never rely on default encoding detection.
  • Apply NFKC normalization early in your pipeline to ensure consistent string comparison and deduplication.
  • Modern NLP tokenizers (BPE, BBPE) operate on bytes, so non-ASCII characters consume more tokens than ASCII — this affects API costs and context windows.
  • For web scraping, use response.apparent_encoding or chardet to detect encoding, and parse HTML with BeautifulSoup to handle entities and encoding automatically.
  • Store intermediate data in Parquet (not CSV) for reliable Unicode handling in data pipelines.

Thêm trong Platform Guides