문자열
프로그래밍 언어에서 문자의 시퀀스. 내부 표현은 다양합니다: UTF-8(Go, Rust, 최신 Python), UTF-16(Java, JavaScript, C#), UTF-32(Python).
What Is a String?
In programming, a string is a sequence of characters used to represent text. Strings are one of the most fundamental data types across all programming languages. The word "string" comes from the metaphor of threading characters together like beads on a string.
How a string is stored, indexed, and measured depends critically on the programming language — specifically on how that language represents characters internally.
Strings and Unicode
Before Unicode, strings were simple: each character was one byte, the encoding was fixed (ASCII, Latin-1, etc.), and string length equaled byte count. Unicode broke this assumption. A Unicode string contains characters from any script, and the same abstract string can be encoded as different byte sequences depending on the encoding (UTF-8, UTF-16, UTF-32).
Modern languages differ in their internal string representation:
| Language | Internal Encoding | Notes |
|---|---|---|
| Python 3 | UTF-32 per codepoint (flexible width internally) | str is sequence of code points |
| JavaScript | UTF-16 code units | .length counts code units, not code points |
| Java | UTF-16 code units | String.length() returns code units |
| Swift | Grapheme clusters | .count returns user-perceived characters |
| Rust | UTF-8 bytes | Indexing by byte, iteration by char |
| Go | UTF-8 bytes | len() returns bytes; []rune() for code points |
| C# | UTF-16 code units | Same as Java |
Python String Fundamentals
# Python 3 str = Unicode string (sequence of code points)
s = "Hello, 世界 🌍"
len(s) # 11 code points
s[7] # "界" — indexed by code point
s[-1] # "🌍" — single emoji code point
# bytes vs str
b = s.encode("utf-8") # bytes object
len(b) # 19 (UTF-8 bytes)
b.decode("utf-8") == s # True
# Byte length varies by encoding
s.encode("utf-8") # variable: ASCII=1, CJK=3, emoji=4 bytes
s.encode("utf-16") # 2 bytes per BMP char, 4 for supplementary
s.encode("utf-32") # always 4 bytes per code point
# String methods work on code points
"café".upper() # "CAFÉ"
"résumé".casefold() # "résumé"
JavaScript String Quirks
// JS strings are UTF-16 — supplementary chars have length 2
const simple = "Hello";
simple.length; // 5
const emoji = "🌍";
emoji.length; // 2 (two UTF-16 code units)
emoji[0]; // "\uD83C" (high surrogate — not meaningful alone)
// Code points (correct count):
[...emoji].length; // 1
emoji.codePointAt(0); // 127757 (0x1F30D)
// Iterating by code point (ES6+)
for (const char of "😀abc") {
console.log(char); // "😀", "a", "b", "c"
}
Strings as Immutable Sequences
In Python, Java, and JavaScript, strings are immutable: you cannot change a character in place. All "modification" operations create new string objects.
s = "hello"
s[0] = "H" # TypeError: 'str' object does not support item assignment
s = "H" + s[1:] # Creates new string "Hello"
String Interning
Many languages intern (cache and reuse) string objects for short or frequently used strings. In Python, string literals and identifiers are typically interned; in Java, string literals in the string pool are interned. This means two variables holding the same short string value may reference the same object in memory.
a = "hello"
b = "hello"
a is b # True (interned)
c = "".join(["h","e","l","l","o"])
c is a # May be False (dynamically created)
Quick Facts
| Property | Value |
|---|---|
| Python type | str (sequence of Unicode code points) |
| JavaScript type | String (UTF-16 code units) |
| Immutability | Immutable in Python, Java, JS, Swift |
Python .length equivalent |
len(s) returns code point count |
JS .length |
Returns UTF-16 code unit count |
| Encoding to bytes | .encode("utf-8") in Python |
| Decoding from bytes | .decode("utf-8") on bytes in Python |
관련 용어
프로그래밍 & 개발의 더 많은 용어
Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …
Python 3 uses Unicode strings by default (str = UTF-8 internally via …
Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …
U+0000(NUL). 첫 번째 유니코드/ASCII 문자로, C/C++에서 문자열 종료자로 사용됩니다. 보안 위험: 널 …
U+FFFD(). 디코더가 유효하지 않은 바이트 시퀀스를 만났을 때 표시되는 문자 — '디코딩에 …
잘못된 인코딩으로 바이트를 디코딩할 때 생기는 깨진 텍스트. 일본어 용어(文字化け). 예: 'café'를 …
유니코드 문자열의 '길이'는 단위에 따라 다릅니다: 코드 단위(JavaScript .length), 코드 포인트(Python len()), …
눈에 보이는 글리프가 없는 문자: 공백, 너비 없는 문자, 제어 문자, 서식 …
UTF-16에서 보충 문자를 인코딩하기 위해 함께 사용되는 두 개의 16비트 코드 단위(상위 …
소스 코드에서 유니코드 문자를 나타내는 구문. 언어마다 다릅니다: \u2713(Python/Java/JS), \u{2713}(JS/Ruby/Rust), \U00012345(Python/C).