Unit pengkodean minimal: satu byte 8-bit dalam UTF-8, satu kata 16-bit dalam UTF-16, satu kata 32-bit dalam UTF-32. Satu karakter mungkin memerlukan beberapa unit kode.

Nilai numerik dalam ruang kode Unicode (U+0000 hingga U+10FFFF), ditulis sebagai U+XXXX. Tidak semua titik kode ditetapkan ke karakter.

Pengkodean Unicode dengan panjang variabel menggunakan 1–4 byte per karakter. Pengkodean dominan di web (98%+ situs web) dengan kompatibilitas mundur penuh terhadap ASCII.

Pengkodean Unicode dengan panjang variabel menggunakan 2 atau 4 byte (1 atau 2 unit kode 16-bit). Digunakan secara internal oleh Java, JavaScript, dan Windows.

Pemrograman & Pengembangan

String

Urutan karakter dalam bahasa pemrograman. Representasi internal bervariasi: UTF-8 (Go, Rust, build Python terbaru), UTF-16 (Java, JavaScript, C#), atau UTF-32 (Python).

2024-03-01 · Updated 2025-04-07

What Is a String?

In programming, a string is a sequence of characters used to represent text. Strings are one of the most fundamental data types across all programming languages. The word "string" comes from the metaphor of threading characters together like beads on a string.

How a string is stored, indexed, and measured depends critically on the programming language — specifically on how that language represents characters internally.

Strings and Unicode

Before Unicode, strings were simple: each character was one byte, the encoding was fixed (ASCII, Latin-1, etc.), and string length equaled byte count. Unicode broke this assumption. A Unicode string contains characters from any script, and the same abstract string can be encoded as different byte sequences depending on the encoding (UTF-8, UTF-16, UTF-32).

Modern languages differ in their internal string representation:

Language	Internal Encoding	Notes
Python 3	UTF-32 per codepoint (flexible width internally)	`str` is sequence of code points
JavaScript	UTF-16 code units	`.length` counts code units, not code points
Java	UTF-16 code units	`String.length()` returns code units
Swift	Grapheme clusters	`.count` returns user-perceived characters
Rust	UTF-8 bytes	Indexing by byte, iteration by char
Go	UTF-8 bytes	`len()` returns bytes; `[]rune()` for code points
C#	UTF-16 code units	Same as Java

Python String Fundamentals

# Python 3 str = Unicode string (sequence of code points)
s = "Hello, 世界 🌍"

len(s)          # 11 code points
s[7]            # "界" — indexed by code point
s[-1]           # "🌍" — single emoji code point

# bytes vs str
b = s.encode("utf-8")   # bytes object
len(b)                  # 19 (UTF-8 bytes)
b.decode("utf-8") == s  # True

# Byte length varies by encoding
s.encode("utf-8")   # variable: ASCII=1, CJK=3, emoji=4 bytes
s.encode("utf-16")  # 2 bytes per BMP char, 4 for supplementary
s.encode("utf-32")  # always 4 bytes per code point

# String methods work on code points
"café".upper()          # "CAFÉ"
"résumé".casefold()     # "résumé"

JavaScript String Quirks

// JS strings are UTF-16 — supplementary chars have length 2
const simple = "Hello";
simple.length;   // 5

const emoji = "🌍";
emoji.length;    // 2 (two UTF-16 code units)
emoji[0];        // "\uD83C" (high surrogate — not meaningful alone)

// Code points (correct count):
[...emoji].length;           // 1
emoji.codePointAt(0);        // 127757 (0x1F30D)

// Iterating by code point (ES6+)
for (const char of "😀abc") {
  console.log(char);  // "😀", "a", "b", "c"
}

Strings as Immutable Sequences

In Python, Java, and JavaScript, strings are immutable: you cannot change a character in place. All "modification" operations create new string objects.

s = "hello"
s[0] = "H"       # TypeError: 'str' object does not support item assignment
s = "H" + s[1:]  # Creates new string "Hello"

String Interning

Many languages intern (cache and reuse) string objects for short or frequently used strings. In Python, string literals and identifiers are typically interned; in Java, string literals in the string pool are interned. This means two variables holding the same short string value may reference the same object in memory.

a = "hello"
b = "hello"
a is b  # True (interned)

c = "".join(["h","e","l","l","o"])
c is a  # May be False (dynamically created)

Quick Facts

Property	Value
Python type	`str` (sequence of Unicode code points)
JavaScript type	`String` (UTF-16 code units)
Immutability	Immutable in Python, Java, JS, Swift
Python `.length` equivalent	`len(s)` returns code point count
JS `.length`	Returns UTF-16 code unit count
Encoding to bytes	`.encode("utf-8")` in Python
Decoding from bytes	`.decode("utf-8")` on bytes in Python

Istilah Terkait

Unit kode Titik kode UTF-8 UTF-16

Lainnya di Pemrograman & Pengembangan

Ambiguitas panjang string

"Panjang" string Unicode bergantung pada satuan: code unit (JavaScript .length), code point …

Ekspresi reguler Unicode

Pola regex menggunakan properti Unicode: \p{L} (huruf apa pun), \p{Script=Greek} (aksara Yunani), …

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Karakter null

U+0000 (NUL). Karakter Unicode/ASCII pertama, digunakan sebagai terminator string dalam C/C++. Risiko …

Karakter pengganti

U+FFFD (�). Ditampilkan saat decoder menemukan urutan byte yang tidak valid — …

Karakter tak terlihat

Karakter apa pun tanpa glyph yang terlihat: whitespace, karakter zero-width, karakter kontrol, …

Mojibake

Teks yang kacau akibat mendekode byte dengan encoding yang salah. Istilah Jepang …

Pasangan pengganti

Dua unit kode 16-bit (high surrogate U+D800–U+DBFF + low surrogate U+DC00–U+DFFF) yang …

Pengkodean / Pendekodean

Encoding mengonversi karakter ke byte (str.encode('utf-8')); decoding mengonversi byte ke karakter (bytes.decode('utf-8')). …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

← Kembali ke Glosarium