Encoding

Understanding Character Encoding: ASCII, Unicode, and UTF-8

From ASCII to Unicode to UTF-8: how computers represent text, why encoding matters, and how to avoid the dreaded mojibake (garbled text).

HandyUtils January 13, 2026 6 min read

Have you ever opened a document and seen gibberish like "café" instead of "café"? That's mojibake—garbled text caused by character encoding mismatches. Understanding character encoding is essential for any developer working with text.

The Problem: Computers Only Understand Numbers

Computers don't understand letters, symbols, or emoji. They only work with numbers—specifically, ones and zeros. To store and transmit text, we need a system that maps characters to numbers. That's what character encoding does.

ASCII: The Original Character Set

ASCII (American Standard Code for Information Interchange) was created in 1963 and became the foundation of text encoding. It uses 7 bits to represent 128 characters:

Range Characters
0-31 Control characters (newline, tab, etc.)
32-47 Punctuation and symbols
48-57 Digits 0-9
65-90 Uppercase A-Z
97-122 Lowercase a-z
127 Delete

Some commonly used ASCII codes:

'A' = 65    'a' = 97    '0' = 48
'B' = 66    'b' = 98    '1' = 49
' ' = 32    '\n' = 10   '\t' = 9

The Limitations of ASCII

ASCII's 128 characters were enough for American English, but the world has far more than 128 characters:

  • Accented letters: é, ñ, ü, ø
  • Other alphabets: Cyrillic (Привет), Greek (Γειά), Arabic (مرحبا)
  • Asian languages: Chinese (你好), Japanese (こんにちは), Korean (안녕)
  • Symbols: €, £, ¥, ©, ®
  • Emoji: 😀, 🎉, 🌍

Various "extended ASCII" encodings added characters 128-255, but each region created its own:

  • ISO-8859-1 (Latin-1): Western European
  • ISO-8859-5: Cyrillic
  • Windows-1252: Microsoft's Western European
  • GB2312: Simplified Chinese

This created chaos: the same byte value meant different characters in different encodings. Document created in one region became unreadable in another.

Unicode: One Encoding to Rule Them All

Unicode was created to solve this by assigning a unique number (called a code point) to every character in every language. The format is U+XXXX where XXXX is a hexadecimal number.

'A' = U+0041
'€' = U+20AC
'中' = U+4E2D
'😀' = U+1F600

Unicode currently defines over 150,000 characters covering:

  • 161 modern and historic scripts
  • Symbol sets (mathematical, musical, technical)
  • Emoji (with more added regularly)

Code points range from U+0000 to U+10FFFF (over 1.1 million possible characters).

UTF-8: The Practical Implementation

Unicode defines what numbers represent what characters, but we still need to store those numbers as bytes. That's where UTF (Unicode Transformation Format) comes in.

UTF-8 is the dominant encoding on the web (used by over 98% of websites). It's a variable-length encoding:

Code Point Range Bytes Bit Pattern
U+0000 - U+007F 1 0xxxxxxx
U+0080 - U+07FF 2 110xxxxx 10xxxxxx
U+0800 - U+FFFF 3 1110xxxx 10xxxxxx 10xxxxxx
U+10000 - U+10FFFF 4 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8's brilliance:

  1. ASCII compatible: ASCII characters use the same single byte, so old ASCII files are valid UTF-8
  2. Self-synchronizing: You can find character boundaries by looking at byte patterns
  3. No null bytes: Safe for C strings (except for U+0000 itself)
  4. Efficient for Latin text: English text is the same size as ASCII

Example encodings:

'A' (U+0041) → 0x41 (1 byte)
'é' (U+00E9) → 0xC3 0xA9 (2 bytes)
'中' (U+4E2D) → 0xE4 0xB8 0xAD (3 bytes)
'😀' (U+1F600) → 0xF0 0x9F 0x98 0x80 (4 bytes)

UTF-8 vs UTF-16 vs UTF-32

Encoding Unit Size Best For
UTF-8 1-4 bytes Web, files, ASCII-heavy text
UTF-16 2-4 bytes Windows internals, Java, JavaScript strings
UTF-32 4 bytes Fixed-width processing (rare)

UTF-16 uses 2 bytes for most common characters and 4 bytes (surrogate pairs) for others. It's used internally by Windows, Java, and JavaScript.

UTF-32 uses 4 bytes for everything—simple but wasteful. It's rarely used for storage or transmission.

Byte Order Mark (BOM)

Multi-byte encodings can store bytes in different orders:

  • Big-endian: Most significant byte first
  • Little-endian: Least significant byte first

The BOM is a special character (U+FEFF) at the start of a file that indicates the byte order:

Encoding BOM Bytes
UTF-8 EF BB BF (optional, often problematic)
UTF-16 BE FE FF
UTF-16 LE FF FE

UTF-8 BOM warning: While valid, the UTF-8 BOM can cause issues:

  • Shell scripts may fail (interpreter can't read shebang)
  • PHP may output characters before headers
  • Concatenated files get BOM in the middle

Unless required (some Windows programs expect it), avoid UTF-8 BOM.

Common Encoding Problems and Fixes

Mojibake Examples

You See You Expected Problem
café café UTF-8 read as Latin-1
café café Latin-1 read as UTF-8
??? or □ 你好 Encoding not available
é é Double encoding

How to Fix

  1. Identify the original encoding: Look for patterns, check file metadata
  2. Convert to UTF-8: Use iconv, Python's codecs, or your language's encoding tools
  3. Set correct headers:
Content-Type: text/html; charset=utf-8
<meta charset="utf-8">

Prevention

# Always specify encoding when opening files
with open('file.txt', 'r', encoding='utf-8') as f:
    content = f.read()
// Node.js
fs.readFileSync('file.txt', 'utf8');

Best Practices for Developers

  1. Use UTF-8 everywhere: Files, databases, APIs, HTML
  2. Declare encoding explicitly: Don't rely on defaults
  3. Validate input: Check for valid UTF-8 sequences
  4. Test with non-ASCII: Include accents, emoji, and various scripts in tests
  5. Store encoding metadata: Know what encoding your data uses
  6. Be careful with string length: "😀".length in JavaScript is 2 (UTF-16 code units), not 1

Database Configuration

-- MySQL
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

-- PostgreSQL
CREATE DATABASE mydb ENCODING 'UTF8';

Note: MySQL's utf8 is actually 3-byte UTF-8 (no emoji). Use utf8mb4 for full UTF-8 support.

Summary

  • ASCII: 128 characters, foundation of text encoding
  • Unicode: Universal character set with 150,000+ characters
  • UTF-8: Variable-length encoding, web standard, ASCII-compatible
  • Use UTF-8 for everything unless you have a specific reason not to
  • Declare encoding explicitly to avoid mojibake

Understanding character encoding helps you build robust applications that work correctly with text from around the world.

Need to convert between encodings? Try our Unicode Encoder tool!

Related Topics
character encoding ascii unicode utf-8 utf-16 mojibake text encoding
Share this article

Continue Reading

Encoding
What is Base64 Encoding and When Should You Use It?

A complete guide to Base64 encoding: how it works, why it exists, and practical use cases for developers including data URIs, API authentication, and email attachments.

Encoding
URL Encoding Explained: Percent-Encoding for Web Developers

Understanding URL encoding (percent-encoding): why special characters need encoding, how it works, and common pitfalls when building URLs and query strings.

Number Systems
Hexadecimal for Developers: Colors, Memory, and More

A practical guide to hexadecimal notation: why developers use hex, reading hex values, and everyday applications from color codes to debugging.