🔧 Practical Unicode

Unicode in Email

Email evolved from ASCII-only systems, and supporting Unicode in email subjects, bodies, and addresses requires encoding schemes like Quoted-Printable, Base64, and the newer Internationalized Email (EAI) standard. This guide explains how Unicode works in email, covering MIME encoding, emoji in subjects, and international email addresses.

·

Email is one of the oldest Internet protocols, and its relationship with Unicode is a story of gradual evolution from pure ASCII to full international support. The original email standards (RFC 822, 1982) assumed all text would be 7-bit ASCII -- no accents, no CJK characters, no emoji. Over four decades, a series of RFCs have layered Unicode support on top of this ASCII foundation, and understanding these layers is essential for anyone building email systems today.

The ASCII Foundation

The original email system, defined by RFC 822 (1982) and later RFC 2822 (2001), was designed for 7-bit ASCII only. This means:

  • Headers (From, To, Subject) could only contain ASCII characters (code points 0-127)
  • Body could only contain lines of ASCII text, each no longer than 998 characters
  • Email addresses could only use ASCII in both the local part and domain

Any byte with the high bit set (values 128-255) was technically illegal in an email message. This was fine for English but completely inadequate for the rest of the world.

MIME: The First Unicode Layer (1996)

MIME (Multipurpose Internet Mail Extensions), defined in RFC 2045-2049, added support for non-ASCII content in the email body by introducing:

Content-Type and Charset

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

The charset parameter tells the email client which encoding to use when interpreting the body bytes. Common values include:

Charset Description Status
us-ascii 7-bit ASCII only Default if not specified
iso-8859-1 Latin-1 (Western Europe) Legacy, common in old emails
windows-1252 Windows Latin (superset of Latin-1) Legacy, very common
iso-2022-jp Japanese (7-bit compatible) Legacy, still used in Japan
utf-8 Full Unicode Recommended for all new email

Content-Transfer-Encoding

Since the email transport layer is 7-bit, MIME provides encoding schemes to safely transmit 8-bit content:

Encoding Description Overhead Best For
7bit No encoding (ASCII only) 0% Plain ASCII text
quoted-printable Encodes non-ASCII as =XX Low for mostly-ASCII European languages
base64 Encodes all bytes as ASCII chars ~33% CJK, binary, attachments
8bit Raw 8-bit (requires SMTP extension) 0% Modern servers only

Quoted-Printable Example

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Caf=C3=A9 is a French word.
Gr=C3=BC=C3=9Fe from Germany.

The sequence =C3=A9 represents the two UTF-8 bytes 0xC3 0xA9, which decode to e\u0301 (U+00E9). Each non-ASCII byte is encoded as = followed by two hex digits.

Base64 Example

Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: base64

5pel5pys6Kqe44Gu44Oh44O844Or

This base64 string decodes to the UTF-8 bytes for "\u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb" (Japanese for "Japanese email").

Encoded Words: Unicode in Headers (RFC 2047)

MIME solved the body encoding problem, but headers (Subject, From display name, etc.) remained ASCII-only. RFC 2047 (1996) introduced encoded words to embed non-ASCII text in headers.

Syntax

=?charset?encoding?encoded_text?=

Where encoding is either B (base64) or Q (quoted-printable-like).

Examples

Subject: =?utf-8?B?5pel5pys6Kqe44Gu44Oh44O844Or?=

This decodes to: \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb (Japanese email)

Subject: =?utf-8?Q?Caf=C3=A9_Menu?=

This decodes to: Cafe\u0301 Menu

RFC 2047 Rules

  • Encoded words can appear in Subject, From (display name), To (display name), and other structured headers.
  • They cannot appear in the address portion of From/To (the user@domain part).
  • Multiple encoded words separated by whitespace are concatenated after decoding.
  • Maximum encoded word length is 75 characters. Longer strings must be split across multiple encoded words.

Generating Encoded Headers in Code

from email.header import Header

# Python handles RFC 2047 encoding automatically
subject = Header("Cafe\u0301 Menu \u2014 \u65e5\u672c\u8a9e", "utf-8")
print(subject.encode())
# =?utf-8?b?Q2Fmw6kgTWVudSDigJQg5pel5pys6Kqe?=
from email.mime.text import MIMEText

msg = MIMEText("\u3053\u3093\u306b\u3061\u306f\u4e16\u754c", "plain", "utf-8")
msg["Subject"] = Header("\u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb", "utf-8")
msg["From"] = "=?utf-8?b?5bGx55Sw5aSq6YOO?= <[email protected]>"
msg["To"] = "[email protected]"
print(msg.as_string())

UTF-8 Headers: The Modern Approach (RFC 6532)

RFC 2047 encoded words are ugly, hard to parse, and limited to 75 characters per encoded word. RFC 6532 (2012) introduced a cleaner approach: allow raw UTF-8 bytes directly in email headers.

Before (RFC 2047)

From: =?utf-8?b?5bGx55Sw5aSq6YOO?= <[email protected]>
Subject: =?utf-8?b?5pel5pys6Kqe44Gu44Oh44O844Or?=

After (RFC 6532)

From: \u5c71\u7530\u592a\u90ce <[email protected]>
Subject: \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb

Requirements

RFC 6532 requires:

  1. The sending server supports the SMTPUTF8 SMTP extension (RFC 6531)
  2. The receiving server also supports SMTPUTF8
  3. Both servers negotiate UTF-8 support during the SMTP handshake
EHLO sender.example.com
250-receiver.example.com
250-SMTPUTF8          <-- UTF-8 support advertised
250 OK

MAIL FROM:<[email protected]> SMTPUTF8
250 OK

Adoption Status

As of 2025, SMTPUTF8 support is widespread among major providers:

Provider SMTPUTF8 Support
Gmail Yes (sending and receiving)
Outlook.com Yes (receiving), partial (sending)
Yahoo Mail Yes (receiving)
Apple Mail (iCloud) Yes
Fastmail Yes
Postfix Yes (since 3.0)
Exim Yes (since 4.86)

However, many smaller mail servers and corporate mail systems still do not support SMTPUTF8. Email software should fall back to RFC 2047 encoding when SMTPUTF8 is not available.

Internationalized Email Addresses (EAI)

The most ambitious Unicode-in-email standard is EAI (Email Address Internationalization), defined in RFC 6530-6533. EAI allows Unicode characters in both the local part and the domain part of email addresses.

Before EAI

[email protected]          -- ASCII only
[email protected]         -- ASCII only

With EAI

\u592a\u90ce@\u65e5\u672c.jp                   -- Japanese local part + IDN domain
\u043f\u043e\u0447\u0442\u0430@\u043f\u0440\u0438\u043c\u0435\u0440.\u0440\u0444             -- Russian local part + Russian domain
user@\u00e4\u00f6\u00fc.example         -- ASCII local part + Unicode domain

How It Works

Part Standard Encoding
Local part RFC 6531 Raw UTF-8 (requires SMTPUTF8)
Domain part IDN (RFC 5891) Punycode (xn--) in DNS, displayed as Unicode

The domain part uses Internationalized Domain Names (IDN), which have been supported since 2003. IDN domains are stored in DNS as Punycode (ASCII-compatible encoding):

Display Form Punycode (DNS)
\u65e5\u672c.jp xn--wgv71a.jp
\u043f\u0440\u0438\u043c\u0435\u0440.\u0440\u0444 xn--e1afmapc.xn--p1ai
\u00e4\u00f6\u00fc.example xn--4ca0bs.example
\ud55c\uad6d.kr xn--3e0b707e.kr

Sending Email to EAI Addresses

import smtplib
from email.message import EmailMessage

msg = EmailMessage()
msg["From"] = "[email protected]"
msg["To"] = "\u592a\u90ce@\u65e5\u672c.jp"
msg["Subject"] = "Test EAI"
msg.set_content("Hello from EAI!")

with smtplib.SMTP("smtp.example.com", 587) as server:
    server.starttls()
    server.login("user", "password")
    # Use send_message which handles SMTPUTF8 automatically
    server.send_message(msg)

EAI Challenges

Despite the standards being in place, EAI adoption faces practical challenges:

  1. Fallback: If the receiving server does not support SMTPUTF8, the message bounces. There is no graceful ASCII fallback for the local part.
  2. Web forms: Many websites still validate email addresses with ASCII-only regex patterns like [a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\\.[a-zA-Z]{2,}, rejecting valid internationalized addresses.
  3. Database storage: Email address columns must support UTF-8, and unique constraints should account for Unicode normalization.
  4. Display: Not all email clients render internationalized addresses correctly.

Emoji in Email

Emoji in email bodies work through standard MIME encoding (UTF-8 + base64 or quoted-printable). Emoji in subject lines work via RFC 2047 encoded words or RFC 6532 raw UTF-8.

Subject Line Emoji

from email.header import Header

# Emoji in subject (RFC 2047 fallback)
subject = Header("\U0001f680 Launch Alert!", "utf-8")
# Encodes to: =?utf-8?b?8J+agCBMYXVuY2ggQWxlcnQh?=

Rendering Differences

Emoji in email subjects render differently across clients:

Client Emoji Display
Gmail (web) Google's emoji set (Noto Color Emoji)
Apple Mail Apple's emoji set
Outlook (desktop) Segoe UI Emoji (Windows), sometimes black and white
Outlook (web) Platform-dependent
Thunderbird Platform-dependent

Some email clients render emoji as color images; others show them in black and white or as empty boxes. For marketing emails, test emoji display across target clients before relying on them.

HTML Email and Unicode

HTML emails add another encoding layer. The HTML document within the MIME part has its own charset declaration:

<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<p>Bonjour de Paris! \U0001f1eb\U0001f1f7</p>
<p>Price: \u20ac19.99</p>
</body>
</html>

The MIME header charset and the HTML meta charset should match. If they disagree, behavior is undefined and client-dependent. Always use UTF-8 for both.

HTML Entities vs Direct Unicode

You can represent Unicode characters in HTML email two ways:

Method Example Result
Direct UTF-8 e\u0301 e\u0301
Numeric entity &#233; e\u0301
Named entity &eacute; e\u0301

Direct UTF-8 is preferred because: - It is more compact - It works in plain text parts as well - Some email clients have incomplete HTML entity support

Best Practices for Email Developers

1. Always Use UTF-8

from email.mime.text import MIMEText

# Always specify UTF-8
msg = MIMEText(body_text, "plain", "utf-8")

2. Handle RFC 2047 Decoding

When reading email, decode RFC 2047 encoded words:

from email.header import decode_header

raw_subject = "=?utf-8?B?5pel5pys6Kqe44Gu44Oh44O844Or?="
parts = decode_header(raw_subject)
subject = ""
for data, charset in parts:
    if isinstance(data, bytes):
        subject += data.decode(charset or "utf-8")
    else:
        subject += data
print(subject)  # \u65e5\u672c\u8a9e\u306e\u30e1\u30fc\u30eb

3. Validate Email Addresses Properly

Do not use ASCII-only regex for email validation. At minimum, allow UTF-8 in the domain part (IDN). For full EAI support, allow UTF-8 in the local part as well:

import re
from email.utils import parseaddr

def validate_email(address: str) -> bool:
    # Parse the address
    name, addr = parseaddr(address)
    if not addr or "@" not in addr:
        return False
    local, domain = addr.rsplit("@", 1)
    # Local part: at least 1 character
    if not local:
        return False
    # Domain: valid IDN or ASCII
    if not domain or ".." in domain:
        return False
    return True

4. Test with International Content

Always test your email system with:

  • Subject: Mix of ASCII, accented Latin, CJK, and emoji
  • Body: Multi-script content (Latin + CJK + Arabic)
  • Sender name: Non-ASCII display names
  • Recipient: IDN domain addresses at minimum

5. Fall Back Gracefully

import smtplib

def send_email(msg):
    with smtplib.SMTP("smtp.example.com", 587) as server:
        server.starttls()
        server.login("user", "password")
        try:
            # Try SMTPUTF8 first
            server.send_message(msg)
        except smtplib.SMTPNotSupportedError:
            # Fall back to RFC 2047 encoding
            # Re-encode headers with encoded words
            for header in ("From", "To", "Subject"):
                if msg[header]:
                    from email.header import Header
                    value = msg[header]
                    del msg[header]
                    msg[header] = Header(value, "utf-8").encode()
            server.send_message(msg)

Timeline of Unicode in Email Standards

Year RFC What It Added
1982 RFC 822 Original email format (ASCII only)
1996 RFC 2045-2049 MIME: charset, base64, quoted-printable
1996 RFC 2047 Encoded words in headers
2001 RFC 2822 Updated email format (still ASCII headers)
2003 RFC 3490 Internationalized Domain Names (IDN)
2008 RFC 5321/5322 Current SMTP and email format standards
2012 RFC 6530-6533 EAI: full UTF-8 in SMTP, headers, addresses
2012 RFC 6855-6858 UTF-8 support in IMAP and POP3

The journey from ASCII-only email to full Unicode support took 30 years. Today, the standards are complete, but adoption continues to be a work in progress -- especially for internationalized email addresses, which remain the final frontier of Unicode in email.

Mais em Practical Unicode

How to Type Special Characters on Windows

Windows provides several methods for typing special characters and Unicode symbols, including …

How to Type Special Characters on Mac

macOS makes it easy to type special characters and Unicode symbols through …

How to Type Special Characters on Linux

Linux offers multiple ways to insert Unicode characters, including Ctrl+Shift+U followed by …

Special Characters on Mobile (iOS/Android)

Typing special Unicode characters on smartphones requires different techniques than on desktop …

How to Fix Mojibake (Garbled Text)

Mojibake is the garbled text you see when a file encoded in …

Unicode in Databases

Storing Unicode text in a database requires choosing the right charset, collation, …

Unicode in Filenames

Modern operating systems support Unicode filenames, but different filesystems use different encodings …

Unicode in Domain Names (IDN)

Internationalized Domain Names (IDNs) allow domain names to contain non-ASCII characters from …

Unicode for Accessibility

Using Unicode symbols, special characters, and emoji in web content has important …

Unicode Text Direction: LTR vs RTL

Unicode supports both left-to-right and right-to-left text through the bidirectional algorithm and …

Unicode Fonts: How Characters Get Rendered

A font file only contains glyphs for a subset of Unicode characters, …

How to Find Any Unicode Character

Finding the exact Unicode character you need can be challenging given over …

Unicode Copy and Paste Best Practices

Copying and pasting text between applications can introduce invisible characters, change normalization …

How to Create Fancy Text with Unicode

Unicode's Mathematical Alphanumeric Symbols block and other areas contain bold, italic, script, …