プログラミングと開発

文字列

プログラミング言語における文字のシーケンス。内部表現はさまざまです:UTF-8(Go・Rust・新しいPython)・UTF-16(Java・JavaScript・C#)・UTF-32(Python)。

· Updated

What Is a String?

In programming, a string is a sequence of characters used to represent text. Strings are one of the most fundamental data types across all programming languages. The word "string" comes from the metaphor of threading characters together like beads on a string.

How a string is stored, indexed, and measured depends critically on the programming language — specifically on how that language represents characters internally.

Strings and Unicode

Before Unicode, strings were simple: each character was one byte, the encoding was fixed (ASCII, Latin-1, etc.), and string length equaled byte count. Unicode broke this assumption. A Unicode string contains characters from any script, and the same abstract string can be encoded as different byte sequences depending on the encoding (UTF-8, UTF-16, UTF-32).

Modern languages differ in their internal string representation:

Language Internal Encoding Notes
Python 3 UTF-32 per codepoint (flexible width internally) str is sequence of code points
JavaScript UTF-16 code units .length counts code units, not code points
Java UTF-16 code units String.length() returns code units
Swift Grapheme clusters .count returns user-perceived characters
Rust UTF-8 bytes Indexing by byte, iteration by char
Go UTF-8 bytes len() returns bytes; []rune() for code points
C# UTF-16 code units Same as Java

Python String Fundamentals

# Python 3 str = Unicode string (sequence of code points)
s = "Hello, 世界 🌍"

len(s)          # 11 code points
s[7]            # "界" — indexed by code point
s[-1]           # "🌍" — single emoji code point

# bytes vs str
b = s.encode("utf-8")   # bytes object
len(b)                  # 19 (UTF-8 bytes)
b.decode("utf-8") == s  # True

# Byte length varies by encoding
s.encode("utf-8")   # variable: ASCII=1, CJK=3, emoji=4 bytes
s.encode("utf-16")  # 2 bytes per BMP char, 4 for supplementary
s.encode("utf-32")  # always 4 bytes per code point

# String methods work on code points
"café".upper()          # "CAFÉ"
"résumé".casefold()     # "résumé"

JavaScript String Quirks

// JS strings are UTF-16 — supplementary chars have length 2
const simple = "Hello";
simple.length;   // 5

const emoji = "🌍";
emoji.length;    // 2 (two UTF-16 code units)
emoji[0];        // "\uD83C" (high surrogate — not meaningful alone)

// Code points (correct count):
[...emoji].length;           // 1
emoji.codePointAt(0);        // 127757 (0x1F30D)

// Iterating by code point (ES6+)
for (const char of "😀abc") {
  console.log(char);  // "😀", "a", "b", "c"
}

Strings as Immutable Sequences

In Python, Java, and JavaScript, strings are immutable: you cannot change a character in place. All "modification" operations create new string objects.

s = "hello"
s[0] = "H"       # TypeError: 'str' object does not support item assignment
s = "H" + s[1:]  # Creates new string "Hello"

String Interning

Many languages intern (cache and reuse) string objects for short or frequently used strings. In Python, string literals and identifiers are typically interned; in Java, string literals in the string pool are interned. This means two variables holding the same short string value may reference the same object in memory.

a = "hello"
b = "hello"
a is b  # True (interned)

c = "".join(["h","e","l","l","o"])
c is a  # May be False (dynamically created)

Quick Facts

Property Value
Python type str (sequence of Unicode code points)
JavaScript type String (UTF-16 code units)
Immutability Immutable in Python, Java, JS, Swift
Python .length equivalent len(s) returns code point count
JS .length Returns UTF-16 code unit count
Encoding to bytes .encode("utf-8") in Python
Decoding from bytes .decode("utf-8") on bytes in Python

関連用語

プログラミングと開発 のその他の用語

Java Unicode

Java strings use UTF-16 internally. char is 16-bit (only BMP). For supplementary …

Python Unicode

Python 3 uses Unicode strings by default (str = UTF-8 internally via …

Rust Unicode

Rust strings (str/String) are guaranteed valid UTF-8. char type represents a Unicode …

Unicode エスケープシーケンス

ソースコードでUnicode文字を表す構文。言語によって異なります:\u2713(Python/Java/JS)・\u{2713}(JS/Ruby/Rust)・\U00012345(Python/C)。

Unicode 正規表現

Unicodeプロパティを使う正規表現パターン:\p{L}(任意の文字)・\p{Script=Greek}(ギリシャ語スクリプト)・\p{Emoji}。言語や正規表現エンジンによってサポートが異なります。

エンコーディング / デコーディング

エンコーディングは文字をバイトに変換し(str.encode('utf-8'))、デコーディングはバイトを文字に変換します(bytes.decode('utf-8'))。正しく行えば文字化けを防げます。

サロゲートペア

UTF-16で補助文字をエンコードするために使われる2つの16ビットコード単位(上位サロゲートU+D800〜U+DBFF + 下位サロゲートU+DC00〜U+DFFF)。😀 = D83D DE00。

ヌル文字

U+0000(NUL)。最初のUnicode/ASCII文字で、C/C++では文字列ターミネータとして使われます。セキュリティリスク:ヌルバイト挿入は脆弱なシステムで文字列を切り捨てる可能性があります。

不可視文字

目に見えるグリフを持たない文字:空白・ゼロ幅文字・制御文字・書式文字。スプーフィングやテキスト密輸などのセキュリティ問題を引き起こす可能性があります。

文字列長の曖昧さ

Unicodeの文字列の「長さ」は単位によって異なります:コード単位(JavaScript .length)・コードポイント(Python len())・書記素クラスター。👨‍👩‍👧‍👦 = 7コードポイント、1書記素。