Pontuação
Caracteres usados para organizar e clarificar a linguagem escrita: pontos, vírgulas, traços, aspas e mais. A Categoria Geral P do Unicode abrange toda a pontuação.
What Is Unicode Punctuation?
Unicode organizes punctuation characters under the top-level General Category P (Punctuation), subdivided into six specific categories. Punctuation encompasses marks used to structure written language: paired delimiters, dashes, connectors, and miscellaneous marks that vary widely across the world's writing systems.
Unlike Latin-centric definitions of punctuation, Unicode's coverage includes CJK ideographic punctuation, Arabic quotation marks, Ethiopic wordspace, and hundreds of other culturally specific marks.
The Six Punctuation Subcategories
| Code | Name | Examples |
|---|---|---|
| Pc | Connector Punctuation | _ (LOW LINE), ‿ (UNDERTIE) |
| Pd | Dash Punctuation | - (HYPHEN-MINUS), – (EN DASH), — (EM DASH), ― (HORIZONTAL BAR) |
| Ps | Open Punctuation | ( [ { ⟨ 「 《 |
| Pe | Close Punctuation | ) ] } ⟩ 」 》 |
| Pi | Initial Punctuation | " (LEFT DOUBLE QUOTATION MARK) « (LEFT-POINTING DOUBLE ANGLE QUOTATION MARK) |
| Pf | Final Punctuation | " (RIGHT DOUBLE QUOTATION MARK) » (RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK) |
| Po | Other Punctuation | ! . , : ; ? @ # % & * / \ (and many more) |
import unicodedata
punctuation_samples = [
("_", "Pc - connector"),
("-", "Pd - hyphen-minus"),
("–", "Pd - en dash U+2013"),
("(", "Ps - open paren"),
(")", "Pe - close paren"),
("「", "Ps - CJK left corner bracket"),
("」", "Pe - CJK right corner bracket"),
("\u201C", "Pi - left double quotation"),
("\u201D", "Pf - right double quotation"),
("«", "Pi - left-pointing angle quotation"),
("»", "Pf - right-pointing angle quotation"),
("。", "Po - CJK ideographic full stop"),
(",", "Po - fullwidth comma"),
("!", "Po - exclamation mark"),
]
for char, label in punctuation_samples:
gc = unicodedata.category(char)
print(f" U+{ord(char):04X} {char} {gc} {label}")
CJK and Fullwidth Punctuation
East Asian writing uses a distinct set of punctuation marks designed for fullwidth (double-byte) layout:
- 。 (U+3002) IDEOGRAPHIC FULL STOP — sentence terminator in Chinese and Japanese
- 、 (U+3001) IDEOGRAPHIC COMMA — list separator
- 「」 (U+300C, U+300D) — Japanese corner brackets for quotation
- 『』 (U+300E, U+300F) — double corner brackets for nested quotation
- · (U+00B7) vs ・ (U+30FB) — Latin middle dot vs Katakana middle dot
Fullwidth versions of ASCII punctuation (U+FF01–U+FF0F) are compatibility equivalents of their halfwidth counterparts; NFKC normalization maps them back to ASCII.
Distinguishing Similar Punctuation
Many punctuation characters look similar but have different code points, names, and semantic roles:
| Character | Code Point | Name |
|---|---|---|
| - | U+002D | HYPHEN-MINUS |
| ‐ | U+2010 | HYPHEN |
| – | U+2013 | EN DASH |
| — | U+2014 | EM DASH |
| ― | U+2015 | HORIZONTAL BAR |
| ' | U+0027 | APOSTROPHE |
| ' | U+2018 | LEFT SINGLE QUOTATION MARK |
| ' | U+2019 | RIGHT SINGLE QUOTATION MARK |
Smart quote substitution in word processors converts ASCII ' and " to their Pi/Pf counterparts. This can cause problems in code contexts where the apostrophe is syntactically significant.
Quick Facts
| Property | Value |
|---|---|
| General Category group | P (Punctuation) |
| Subcategories | Pc, Pd, Ps, Pe, Pi, Pf, Po (7) |
| Python function | unicodedata.category(char) starting with P |
| Regex | \p{Punctuation} or \p{P} (PCRE/Python regex) |
| CJK punctuation block | U+3000–U+303F (CJK Symbols and Punctuation) |
| Fullwidth punctuation | U+FF01–U+FF60 (compatibility equivalents) |
| Spec reference | Unicode Standard Chapter 6 |
Termos Relacionados
Mais em Propriedades
Nomes alternativos para caracteres, pois os nomes Unicode não podem mudar conforme …
Intervalo contíguo nomeado de pontos de código (por exemplo, Basic Latin = …
Propriedade que determina como um caractere se comporta em texto bidirecional (LTR, …
Classificação de cada ponto de código em uma das 30 categorias (Lu, …
Valor numérico (0–254) que controla a ordenação de marcas de combinação durante …
O mapeamento de um caractere para suas partes componentes. A decomposição canônica …
Unicode property (UAX#11) classifying characters as Narrow, Wide, Fullwidth, Halfwidth, Ambiguous, or …
Duas sequências de caracteres que são semanticamente idênticas e devem ser tratadas …
Duas sequências de caracteres com o mesmo conteúdo abstrato que podem diferir …
O "caractere" percebido pelo usuário — o que parece uma única unidade. …