Have you ever opened a document and seen gibberish like "café" instead of "café"? That's mojibake—garbled text caused by character encoding mismatches. Understanding character encoding is essential for any developer working with text.
The Problem: Computers Only Understand Numbers
Computers don't understand letters, symbols, or emoji. They only work with numbers—specifically, ones and zeros. To store and transmit text, we need a system that maps characters to numbers. That's what character encoding does.
ASCII: The Original Character Set
ASCII (American Standard Code for Information Interchange) was created in 1963 and became the foundation of text encoding. It uses 7 bits to represent 128 characters:
| Range | Characters |
|---|---|
| 0-31 | Control characters (newline, tab, etc.) |
| 32-47 | Punctuation and symbols |
| 48-57 | Digits 0-9 |
| 65-90 | Uppercase A-Z |
| 97-122 | Lowercase a-z |
| 127 | Delete |
Some commonly used ASCII codes:
'A' = 65 'a' = 97 '0' = 48
'B' = 66 'b' = 98 '1' = 49
' ' = 32 '\n' = 10 '\t' = 9
The Limitations of ASCII
ASCII's 128 characters were enough for American English, but the world has far more than 128 characters:
- Accented letters: é, ñ, ü, ø
- Other alphabets: Cyrillic (Привет), Greek (Γειά), Arabic (مرحبا)
- Asian languages: Chinese (你好), Japanese (こんにちは), Korean (안녕)
- Symbols: €, £, ¥, ©, ®
- Emoji: 😀, 🎉, 🌍
Various "extended ASCII" encodings added characters 128-255, but each region created its own:
- ISO-8859-1 (Latin-1): Western European
- ISO-8859-5: Cyrillic
- Windows-1252: Microsoft's Western European
- GB2312: Simplified Chinese
This created chaos: the same byte value meant different characters in different encodings. Document created in one region became unreadable in another.
Unicode: One Encoding to Rule Them All
Unicode was created to solve this by assigning a unique number (called a code point) to every character in every language. The format is U+XXXX where XXXX is a hexadecimal number.
'A' = U+0041
'€' = U+20AC
'中' = U+4E2D
'😀' = U+1F600
Unicode currently defines over 150,000 characters covering:
- 161 modern and historic scripts
- Symbol sets (mathematical, musical, technical)
- Emoji (with more added regularly)
Code points range from U+0000 to U+10FFFF (over 1.1 million possible characters).
UTF-8: The Practical Implementation
Unicode defines what numbers represent what characters, but we still need to store those numbers as bytes. That's where UTF (Unicode Transformation Format) comes in.
UTF-8 is the dominant encoding on the web (used by over 98% of websites). It's a variable-length encoding:
| Code Point Range | Bytes | Bit Pattern |
|---|---|---|
| U+0000 - U+007F | 1 | 0xxxxxxx |
| U+0080 - U+07FF | 2 | 110xxxxx 10xxxxxx |
| U+0800 - U+FFFF | 3 | 1110xxxx 10xxxxxx 10xxxxxx |
| U+10000 - U+10FFFF | 4 | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
UTF-8's brilliance:
- ASCII compatible: ASCII characters use the same single byte, so old ASCII files are valid UTF-8
- Self-synchronizing: You can find character boundaries by looking at byte patterns
- No null bytes: Safe for C strings (except for U+0000 itself)
- Efficient for Latin text: English text is the same size as ASCII
Example encodings:
'A' (U+0041) → 0x41 (1 byte)
'é' (U+00E9) → 0xC3 0xA9 (2 bytes)
'中' (U+4E2D) → 0xE4 0xB8 0xAD (3 bytes)
'😀' (U+1F600) → 0xF0 0x9F 0x98 0x80 (4 bytes)
UTF-8 vs UTF-16 vs UTF-32
| Encoding | Unit Size | Best For |
|---|---|---|
| UTF-8 | 1-4 bytes | Web, files, ASCII-heavy text |
| UTF-16 | 2-4 bytes | Windows internals, Java, JavaScript strings |
| UTF-32 | 4 bytes | Fixed-width processing (rare) |
UTF-16 uses 2 bytes for most common characters and 4 bytes (surrogate pairs) for others. It's used internally by Windows, Java, and JavaScript.
UTF-32 uses 4 bytes for everything—simple but wasteful. It's rarely used for storage or transmission.
Byte Order Mark (BOM)
Multi-byte encodings can store bytes in different orders:
- Big-endian: Most significant byte first
- Little-endian: Least significant byte first
The BOM is a special character (U+FEFF) at the start of a file that indicates the byte order:
| Encoding | BOM Bytes |
|---|---|
| UTF-8 | EF BB BF (optional, often problematic) |
| UTF-16 BE | FE FF |
| UTF-16 LE | FF FE |
UTF-8 BOM warning: While valid, the UTF-8 BOM can cause issues:
- Shell scripts may fail (interpreter can't read shebang)
- PHP may output characters before headers
- Concatenated files get BOM in the middle
Unless required (some Windows programs expect it), avoid UTF-8 BOM.
Common Encoding Problems and Fixes
Mojibake Examples
| You See | You Expected | Problem |
|---|---|---|
| café | café | UTF-8 read as Latin-1 |
| café | café | Latin-1 read as UTF-8 |
| ??? or □ | 你好 | Encoding not available |
| é | é | Double encoding |
How to Fix
- Identify the original encoding: Look for patterns, check file metadata
- Convert to UTF-8: Use
iconv, Python'scodecs, or your language's encoding tools - Set correct headers:
Content-Type: text/html; charset=utf-8
<meta charset="utf-8">
Prevention
# Always specify encoding when opening files
with open('file.txt', 'r', encoding='utf-8') as f:
content = f.read()
// Node.js
fs.readFileSync('file.txt', 'utf8');
Best Practices for Developers
- Use UTF-8 everywhere: Files, databases, APIs, HTML
- Declare encoding explicitly: Don't rely on defaults
- Validate input: Check for valid UTF-8 sequences
- Test with non-ASCII: Include accents, emoji, and various scripts in tests
- Store encoding metadata: Know what encoding your data uses
- Be careful with string length: "😀".length in JavaScript is 2 (UTF-16 code units), not 1
Database Configuration
-- MySQL
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
-- PostgreSQL
CREATE DATABASE mydb ENCODING 'UTF8';
Note: MySQL's utf8 is actually 3-byte UTF-8 (no emoji). Use utf8mb4 for full UTF-8 support.
Summary
- ASCII: 128 characters, foundation of text encoding
- Unicode: Universal character set with 150,000+ characters
- UTF-8: Variable-length encoding, web standard, ASCII-compatible
- Use UTF-8 for everything unless you have a specific reason not to
- Declare encoding explicitly to avoid mojibake
Understanding character encoding helps you build robust applications that work correctly with text from around the world.
Need to convert between encodings? Try our Unicode Encoder tool!