Subtitle Encoding Guide: UTF-8, Character Sets, and Special Characters
Understanding subtitle file encoding. How to handle special characters, fix encoding issues, and ensure cross-platform compatibility.
Introduction
Character encoding is the invisible foundation of every subtitle file. When encoding is correct, viewers see proper text regardless of language, script, or platform. When encoding is wrong, subtitles display as question marks, empty boxes, or gibberish — commonly known as mojibake. These encoding problems are among the most frustrating issues in subtitle creation because the source file appears correct on your machine but broken everywhere else.
This guide explains everything about subtitle encoding: how character encoding works technically, why UTF-8 is the universal standard, detailed BOM explanation, common encoding problems and their symptoms, platform-specific issues (Windows vs Mac vs Linux), a comprehensive fix workflow, detection methods, an encoding conversion reference table, and a thorough testing checklist. Whether you are managing subtitles in 20 languages or simply trying to get accented characters to display correctly, this guide has you covered.
How Character Encoding Works
At its core, character encoding is a mapping system that assigns numeric codes to characters. When you type the letter "A", your computer stores it as the number 65. When the file is opened, the system reads the number 65 and displays "A". The mapping between numbers and characters is defined by the encoding standard.
ASCII: The Foundation
ASCII (American Standard Code for Information Interchange) was developed in the 1960s and defines 128 characters: English letters (uppercase and lowercase), digits 0-9, punctuation marks, and control characters. Each character uses 7 bits, fitting comfortably in a single byte (8 bits) and leaving the 8th bit unused.
ASCII's limitation is clear: it only supports English. Characters like \u00e9, \u00f1, \u00fc, and entire writing systems like Cyrillic, Arabic, Chinese, Japanese, and Korean are completely absent. Any character not in the ASCII table cannot be represented in an ASCII-encoded file.
The 8-bit Extensions
To support more languages, various extensions of ASCII were created that use all 8 bits of a byte, supporting up to 256 characters:
Each 8-bit encoding supports a specific group of languages, but none supports all languages. This fundamental limitation is why multilingual subtitle files require Unicode.
Unicode and UTF-8: The Universal Solution
Unicode is a universal character set that assigns a unique number (code point) to every character in every writing system — over 140,000 characters covering more than 150 scripts. Unicode itself is not an encoding; it is a character set. The encoding of Unicode into bytes is handled by UTF-8, UTF-16, and UTF-32.
UTF-8 is the dominant encoding for the web and modern computing:
How UTF-8 encodes characters:
| Code Point Range | Bytes | Byte 1 | Byte 2 | Byte 3 | Byte 4 | Example |
|-----------------|-------|--------|--------|--------|--------|---------|
| U+0000 to U+007F | 1 | 0xxxxxxx | — | — | — | "A" (U+0041) = 0x41 |
| U+0080 to U+07FF | 2 | 110xxxxx | 10xxxxxx | — | — | "\u00e9" (U+00E9) = 0xC3 0xA9 |
| U+0800 to U+FFFF | 3 | 1110xxxx | 10xxxxxx | 10xxxxxx | — | "\u4e2d" (U+4E2D) = 0xE4 0xB8 0xAD |
| U+10000 to U+10FFFF | 4 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx | Emoji (U+1F600) = 0xF0 0x9F 0x98 0x80 |
For example, the character "\u00e9" (e with acute accent, used in words like "caf\u00e9") encodes as two bytes in UTF-8: `11000011 10101001` (0xC3 0xA9). The character "\u4e2d" (the Chinese character for "middle") encodes as three bytes: `11100100 10111000 10101101` (0xE4 0xB8 0xAD).
UTF-16
UTF-16 uses 2 bytes (16 bits) per character for most common characters, and 4 bytes for less common ones (using surrogate pairs). It is commonly used by Windows internally and by the Java and .NET runtimes. However, UTF-16 has significant disadvantages for subtitles:
Detailed BOM Explanation
The Byte Order Mark (BOM) is a special Unicode character (U+FEFF) placed at the beginning of a text file. Its purposes are:
BOM in Different Encodings
| Encoding | BOM Bytes | BOM Hex | Typical Use |
|----------|-----------|---------|-------------|
| UTF-8 | EF BB BF | 0xEF, 0xBB, 0xBF | Windows Notepad default; technically optional per Unicode spec |
| UTF-16 LE | FF FE | 0xFF, 0xFE | Standard for UTF-16 little-endian on Windows |
| UTF-16 BE | FE FF | 0xFE, 0xFF | Standard for UTF-16 big-endian on some Unix systems |
| UTF-32 LE | FF FE 00 00 | 0xFF, 0xFE, 0x00, 0x00 | Rare — used by some IBM systems |
| UTF-32 BE | 00 00 FE FF | 0x00, 0x00, 0xFE, 0xFF | Rare — used by some Unix systems |
The UTF-8 BOM Problem
The UTF-8 BOM (bytes EF BB BF) is particularly problematic for subtitles:
Best practice for subtitles: Always save as UTF-8 without BOM. This ensures maximum compatibility across all platforms, players, and devices.
If you receive a UTF-8 file with BOM, use our Online Editor to detect and remove it automatically.
Platform-Specific Encoding Issues
Windows
macOS
Linux
Smart TVs and Embedded Devices
Common Encoding Problems and Their Symptoms
| Symptom | What You See | Likely Cause | Solution |
|---------|-------------|--------------|----------|
| Question marks (?) | "caf?" instead of "caf\u00e9" | File saved as ASCII or Latin-1 with non-ASCII characters | Re-encode as UTF-8 |
| Empty boxes (\u25a1) | "\u25a1\u25a1\u25a1" instead of Chinese characters | Characters not supported by current encoding | Re-encode as UTF-8 |
| Mojibake: \u00c3\u00a9 for \u00e9 | "caf\u00e9" appears as "caf\u00c3\u00a9" | UTF-8 file opened as Latin-1/Windows-1252 | Re-interpret as UTF-8 |
| Mojibake: double corruption | "caf\u00c3\u00a9" appears as "cafÃ\u00a9" | Double encoding — UTF-8 text saved again as Latin-1 | Fix at source, re-encode once to UTF-8 |
| Invisible first character | First subtitle entry has extra blank line or character | UTF-8 BOM (EF BB BF) | Remove BOM |
| Entire file is blank | File opens as empty | UTF-16 file opened as UTF-8 — null bytes appear as empty | Open as UTF-16 and re-save as UTF-8 |
| \u00e2\u20ac\u201c for em dash (\u2014) | "\u00e2\u20ac\u201c" instead of "\u2014" | UTF-8 file displayed as Windows-1252 | Re-save as UTF-8 |
| \u00e2\u20ac\u2122 for apostrophe (\u2019) | "don\u00e2\u20ac\u2122t" instead of "don\u2019t" | UTF-8 smart quotes displayed as Windows-1252 | Re-save as UTF-8 |
Comprehensive Fix Workflow
Step 1: Detect the Current Encoding
Before fixing, you need to know what you are dealing with:
Method A: Use our Online Editor. Upload the file and the tool automatically detects its encoding and reports it.
Method B: Notepad++ — the status bar shows the encoding (e.g., "UTF-8", "ANSI", "UTF-16 LE BOM").
Method C: VS Code — the bottom-right corner shows the encoding. Click it to see details or change.
Method D: PowerShell BOM detection:
```powershell
$bytes = Get-Content -Path "file.srt" -Encoding Byte -TotalCount 4
if ($bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF) {
"UTF-8 with BOM"
} elseif ($bytes[0] -eq 0xFF -and $bytes[1] -eq 0xFE) {
"UTF-16 LE"
} elseif ($bytes[0] -eq 0xFE -and $bytes[1] -eq 0xFF) {
"UTF-16 BE"
} elseif ($bytes[0] -eq 0 -and $bytes[1] -eq 0 -and $bytes[2] -eq 0xFE -and $bytes[3] -eq 0xFF) {
"UTF-32 BE"
} elseif ($bytes[0] -eq 0xFF -and $bytes[1] -eq 0xFE -and $bytes[2] -eq 0 -and $bytes[3] -eq 0) {
"UTF-32 LE"
} else {
"No BOM detected — likely UTF-8 without BOM or ANSI/Latin-1"
}
```
Method E: Visual inspection. Open the file in a text editor. If you see characters like Ã\u00a9 (instead of \u00e9), \u00c7 (instead of \u00fc), or \u00e2\u20ac\u2122 (instead of \u2019), the file is UTF-8 content being interpreted as Latin-1/Windows-1252.
Step 2: Identify the Problem
Use the symptom table in the previous section to match what you see with the likely cause. The key patterns to recognize:
Step 3: Convert to UTF-8 Without BOM
Option 1: Use our Online Editor — upload, select "UTF-8 without BOM" as target, preview, download.
Option 2: Notepad++:
Option 3: VS Code:
Option 4: PowerShell:
```powershell
# Read with source encoding, write as UTF-8 without BOM
$content = Get-Content -Path "input.srt" -Encoding UTF8
Set-Content -Path "output.srt" -Value $content -Encoding UTF8
```
For UTF-16 or ANSI source files, specify the encoding explicitly:
```powershell
# UTF-16 LE (common Windows export)
$content = Get-Content -Path "input.srt" -Encoding Unicode
Set-Content -Path "output.srt" -Value $content -Encoding UTF8
# ANSI / Windows-1252
$content = Get-Content -Path "input.srt" -Encoding Default
Set-Content -Path "output.srt" -Value $content -Encoding UTF8
```
Option 5: Linux/Mac command line with iconv:
```bash
# Detect encoding
file -I input.srt
# Convert from detected encoding to UTF-8
iconv -f WINDOWS-1252 -t UTF-8 input.srt > output.srt
# Remove BOM if present
sed -i '1s/^\xEF\xBB\xBF//' output.srt
```
Step 4: Validate the Conversion
Step 5: Update Your Workflow
Encoding Conversion Reference Table
| Source Encoding | Characters Added vs ASCII | Common in | Conversion to UTF-8 |
|-----------------|---------------------------|-----------|---------------------|
| ASCII | None (0-127 only) | Legacy text | Direct, no changes needed |
| Latin-1 (ISO-8859-1) | Western European accents | Legacy Linux, older web | Direct mapping, compatible |
| Windows-1252 | Western + smart quotes, euro | Legacy Windows apps | Compatible (superset of Latin-1 at 0x80-0x9F) |
| Mac Roman | Western + Apple symbols | Legacy macOS | Partial — some Apple symbols have no Unicode equivalent |
| UTF-16 LE | All Unicode (2 bytes/char) | Windows internal, .NET | Convert and remove BOM |
| UTF-16 BE | All Unicode (2 bytes/char) | Some Unix, Java | Convert and remove BOM |
| Shift-JIS | Japanese | Legacy Japanese subs | Convert and verify each character |
| GB2312 | Simplified Chinese | Legacy Chinese subs | Convert and verify each character |
| Big5 | Traditional Chinese | Legacy Chinese subs (Taiwan/HK) | Convert and verify each character |
| KOI8-R | Russian/Cyrillic | Legacy Russian subs | Convert; verify Cyrillic characters |
| Windows-1251 | Russian/Cyrillic | Legacy Windows Russian apps | Convert; verify Cyrillic characters |
| ISO-8859-7 | Greek | Legacy Greek subs | Convert; verify Greek characters |
Testing Checklist
Before finalizing any subtitle file, verify all items on this checklist:
| # | Check | How to Test | Pass Criteria |
|---|-------|-------------|---------------|
| 1 | UTF-8 encoding | Open in hex editor or check editor status bar | No BOM (EF BB BF) at start; declared as UTF-8 |
| 2 | All characters display | Open in VLC and review visually | No question marks, empty boxes, or mojibake characters |
| 3 | Accented characters | Check common ones: \u00e9, \u00fc, \u00f1, \u00e7 | Display as intended characters |
| 4 | CJK characters (if used) | Check first and last subtitle; test several in middle | Chinese, Japanese, or Korean characters display correctly |
| 5 | Special symbols | Check \u00a9, \u00ae, \u2122, \u2014 (em dash), \u2018 (apostrophe) | Display as intended symbols |
| 6 | Right-to-left text (if used) | Check Arabic, Hebrew, or Farsi subtitles | Text displays right-aligned and reads right-to-left |
| 7 | Platform render test | Test on target platform (YouTube, VLC, TV) | All characters display correctly in target environment |
| 8 | BOM check | Open in hex editor or use PowerShell | No invisible character at position 0 |
| 9 | File size sanity | Compare with original after conversion | Size changed predictably (UTF-8 ~10-20% larger than Latin-1) |
| 10 | Full file scan | Use our Online Editor | Editor reports valid UTF-8 with no encoding errors |
Best Practices Summary
Related Tools
Conclusion
Character encoding is the invisible infrastructure that ensures your subtitles display correctly regardless of language, platform, or device. UTF-8 without BOM is the gold standard — universal, efficient, and compatible with all modern systems. By understanding how encoding works, knowing the common pitfalls on each platform, and following a systematic fix workflow, you can eliminate encoding problems from your subtitle workflow entirely.
Use our Online Editor to detect, fix, and convert encoding automatically, or browse our subtitle tools for more specialized utilities.