Encoding

Understanding Character Encoding: ASCII, Unicode, and UTF-8

From ASCII to Unicode to UTF-8: how computers represent text, why encoding matters, and how to avoid the dreaded mojibake (garbled text).

HandyUtils January 13, 2026 6 min read

Have you ever opened a document and seen gibberish like "café" instead of "café"? That's mojibake—garbled text caused by character encoding mismatches. Understanding character encoding is essential for any developer working with text.

The Problem: Computers Only Understand Numbers

Computers don't understand letters, symbols, or emoji. They only work with numbers—specifically, ones and zeros. To store and transmit text, we need a system that maps characters to numbers. That's what character encoding does.

ASCII: The Original Character Set

ASCII (American Standard Code for Information Interchange) was created in 1963 and became the foundation of text encoding. It uses 7 bits to represent 128 characters:

Range	Characters
0-31	Control characters (newline, tab, etc.)
32-47	Punctuation and symbols
48-57	Digits 0-9
65-90	Uppercase A-Z
97-122	Lowercase a-z
127	Delete

Some commonly used ASCII codes:

'A' = 65    'a' = 97    '0' = 48
'B' = 66    'b' = 98    '1' = 49
' ' = 32    '\n' = 10   '\t' = 9

The Limitations of ASCII

ASCII's 128 characters were enough for American English, but the world has far more than 128 characters:

Accented letters: é, ñ, ü, ø
Other alphabets: Cyrillic (Привет), Greek (Γειά), Arabic (مرحبا)
Asian languages: Chinese (你好), Japanese (こんにちは), Korean (안녕)
Symbols: €, £, ¥, ©, ®
Emoji: 😀, 🎉, 🌍

Various "extended ASCII" encodings added characters 128-255, but each region created its own:

ISO-8859-1 (Latin-1): Western European
ISO-8859-5: Cyrillic
Windows-1252: Microsoft's Western European
GB2312: Simplified Chinese

This created chaos: the same byte value meant different characters in different encodings. Document created in one region became unreadable in another.

Unicode: One Encoding to Rule Them All

Unicode was created to solve this by assigning a unique number (called a code point) to every character in every language. The format is U+XXXX where XXXX is a hexadecimal number.

'A' = U+0041
'€' = U+20AC
'中' = U+4E2D
'😀' = U+1F600

Unicode currently defines over 150,000 characters covering:

161 modern and historic scripts
Symbol sets (mathematical, musical, technical)
Emoji (with more added regularly)

Code points range from U+0000 to U+10FFFF (over 1.1 million possible characters).

UTF-8: The Practical Implementation

Unicode defines what numbers represent what characters, but we still need to store those numbers as bytes. That's where UTF (Unicode Transformation Format) comes in.

UTF-8 is the dominant encoding on the web (used by over 98% of websites). It's a variable-length encoding:

Code Point Range	Bytes	Bit Pattern
U+0000 - U+007F	1	0xxxxxxx
U+0080 - U+07FF	2	110xxxxx 10xxxxxx
U+0800 - U+FFFF	3	1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8's brilliance:

ASCII compatible: ASCII characters use the same single byte, so old ASCII files are valid UTF-8
Self-synchronizing: You can find character boundaries by looking at byte patterns
No null bytes: Safe for C strings (except for U+0000 itself)
Efficient for Latin text: English text is the same size as ASCII

Example encodings:

'A' (U+0041) → 0x41 (1 byte)
'é' (U+00E9) → 0xC3 0xA9 (2 bytes)
'中' (U+4E2D) → 0xE4 0xB8 0xAD (3 bytes)
'😀' (U+1F600) → 0xF0 0x9F 0x98 0x80 (4 bytes)

UTF-8 vs UTF-16 vs UTF-32

Encoding	Unit Size	Best For
UTF-8	1-4 bytes	Web, files, ASCII-heavy text
UTF-16	2-4 bytes	Windows internals, Java, JavaScript strings
UTF-32	4 bytes	Fixed-width processing (rare)

UTF-16 uses 2 bytes for most common characters and 4 bytes (surrogate pairs) for others. It's used internally by Windows, Java, and JavaScript.

UTF-32 uses 4 bytes for everything—simple but wasteful. It's rarely used for storage or transmission.

Byte Order Mark (BOM)

Multi-byte encodings can store bytes in different orders:

Big-endian: Most significant byte first
Little-endian: Least significant byte first

The BOM is a special character (U+FEFF) at the start of a file that indicates the byte order:

Encoding	BOM Bytes
UTF-8	EF BB BF (optional, often problematic)
UTF-16 BE	FE FF
UTF-16 LE	FF FE

UTF-8 BOM warning: While valid, the UTF-8 BOM can cause issues:

Shell scripts may fail (interpreter can't read shebang)
PHP may output characters before headers
Concatenated files get BOM in the middle

Unless required (some Windows programs expect it), avoid UTF-8 BOM.

Common Encoding Problems and Fixes

Mojibake Examples

You See	You Expected	Problem
café	café	UTF-8 read as Latin-1
cafÃ©	café	Latin-1 read as UTF-8
??? or □	你好	Encoding not available
Ã©	é	Double encoding

How to Fix

Identify the original encoding: Look for patterns, check file metadata
Convert to UTF-8: Use iconv, Python's codecs, or your language's encoding tools
Set correct headers:

Content-Type: text/html; charset=utf-8

<meta charset="utf-8">

Prevention

# Always specify encoding when opening files
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()

// Node.js
fs.readFileSync('file.txt', 'utf8');

Best Practices for Developers

Use UTF-8 everywhere: Files, databases, APIs, HTML
Declare encoding explicitly: Don't rely on defaults
Validate input: Check for valid UTF-8 sequences
Test with non-ASCII: Include accents, emoji, and various scripts in tests
Store encoding metadata: Know what encoding your data uses
Be careful with string length: "😀".length in JavaScript is 2 (UTF-16 code units), not 1

Database Configuration

-- MySQL
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- PostgreSQL
CREATE DATABASE mydb ENCODING 'UTF8';

Note: MySQL's utf8 is actually 3-byte UTF-8 (no emoji). Use utf8mb4 for full UTF-8 support.

Summary

ASCII: 128 characters, foundation of text encoding
Unicode: Universal character set with 150,000+ characters
UTF-8: Variable-length encoding, web standard, ASCII-compatible
Use UTF-8 for everything unless you have a specific reason not to
Declare encoding explicitly to avoid mojibake

Understanding character encoding helps you build robust applications that work correctly with text from around the world.

Need to convert between encodings? Try our Unicode Encoder tool!

Share this article

Tweet Share Share

Try These Tools

Continue Reading

Encoding

What is Base64 Encoding and When Should You Use It?

A complete guide to Base64 encoding: how it works, why it exists, and practical use cases for developers including data URIs, API authentication, and email attachments.

4 min read

Encoding

URL Encoding Explained: Percent-Encoding for Web Developers

Understanding URL encoding (percent-encoding): why special characters need encoding, how it works, and common pitfalls when building URLs and query strings.

5 min read

Number Systems

Hexadecimal for Developers: Colors, Memory, and More

A practical guide to hexadecimal notation: why developers use hex, reading hex values, and everyday applications from color codes to debugging.

5 min read

Understanding Character Encoding: ASCII, Unicode, and UTF-8

The Problem: Computers Only Understand Numbers

ASCII: The Original Character Set

The Limitations of ASCII

Unicode: One Encoding to Rule Them All

UTF-8: The Practical Implementation

UTF-8 vs UTF-16 vs UTF-32

Byte Order Mark (BOM)

Common Encoding Problems and Fixes

Mojibake Examples

How to Fix

Prevention

Best Practices for Developers

Database Configuration

Summary

Related Topics

Share this article

Try These Tools

Related Articles

Continue Reading

What is Base64 Encoding and When Should You Use It?

URL Encoding Explained: Percent-Encoding for Web Developers

Hexadecimal for Developers: Colors, Memory, and More