Introduction

Character encoding is the invisible foundation of every subtitle file. When encoding is correct, viewers see proper text regardless of language, script, or platform. When encoding is wrong, subtitles display as question marks, empty boxes, or gibberish — commonly known as mojibake. These encoding problems are among the most frustrating issues in subtitle creation because the source file appears correct on your machine but broken everywhere else.

This guide explains everything about subtitle encoding: how character encoding works technically, why UTF-8 is the universal standard, detailed BOM explanation, common encoding problems and their symptoms, platform-specific issues (Windows vs Mac vs Linux), a comprehensive fix workflow, detection methods, an encoding conversion reference table, and a thorough testing checklist. Whether you are managing subtitles in 20 languages or simply trying to get accented characters to display correctly, this guide has you covered.

How Character Encoding Works

At its core, character encoding is a mapping system that assigns numeric codes to characters. When you type the letter "A", your computer stores it as the number 65. When the file is opened, the system reads the number 65 and displays "A". The mapping between numbers and characters is defined by the encoding standard.

ASCII: The Foundation

ASCII (American Standard Code for Information Interchange) was developed in the 1960s and defines 128 characters: English letters (uppercase and lowercase), digits 0-9, punctuation marks, and control characters. Each character uses 7 bits, fitting comfortably in a single byte (8 bits) and leaving the 8th bit unused.

ASCII's limitation is clear: it only supports English. Characters like \u00e9, \u00f1, \u00fc, and entire writing systems like Cyrillic, Arabic, Chinese, Japanese, and Korean are completely absent. Any character not in the ASCII table cannot be represented in an ASCII-encoded file.

The 8-bit Extensions

To support more languages, various extensions of ASCII were created that use all 8 bits of a byte, supporting up to 256 characters:

Latin-1 (ISO-8859-1): Supports Western European languages — includes accented characters like \u00e9, \u00fc, \u00f1, and special symbols like \u00a9 and \u00ae

Latin-2 (ISO-8859-2): Supports Central and Eastern European languages (Czech, Polish, Hungarian, Romanian)

Windows-1252: Microsoft's superset of Latin-1 with additional characters like smart quotes (\u201c, \u201d) and the euro sign (\u20ac)

Shift-JIS: Japanese encoding (commonly used for legacy Japanese subtitles)

GB2312 / GBK: Simplified Chinese encoding

Big5: Traditional Chinese encoding

KOI8-R: Russian and Cyrillic encoding (common in older Russian subtitles)

ISO-8859-7: Greek encoding

Each 8-bit encoding supports a specific group of languages, but none supports all languages. This fundamental limitation is why multilingual subtitle files require Unicode.

Unicode and UTF-8: The Universal Solution

Unicode is a universal character set that assigns a unique number (code point) to every character in every writing system — over 140,000 characters covering more than 150 scripts. Unicode itself is not an encoding; it is a character set. The encoding of Unicode into bytes is handled by UTF-8, UTF-16, and UTF-32.

UTF-8 is the dominant encoding for the web and modern computing:

Variable width: Characters use 1 to 4 bytes depending on the code point

ASCII compatible: The first 128 characters match ASCII exactly, using 1 byte each — meaning any ASCII file is already a valid UTF-8 file

Efficient: English text is compact (1 byte per character), while less common characters use more bytes only when needed

No endianness: Unlike UTF-16, UTF-8 has no byte order issues — the byte sequence is unambiguous

Universal: Supports all Unicode characters — every language, script, and symbol ever defined

How UTF-8 encodes characters:

|-----------------|-------|--------|--------|--------|--------|---------|

| U+0000 to U+007F | 1 | 0xxxxxxx | — | — | — | "A" (U+0041) = 0x41 |

| U+0080 to U+07FF | 2 | 110xxxxx | 10xxxxxx | — | — | "\u00e9" (U+00E9) = 0xC3 0xA9 |

For example, the character "\u00e9" (e with acute accent, used in words like "caf\u00e9") encodes as two bytes in UTF-8: `11000011 10101001` (0xC3 0xA9). The character "\u4e2d" (the Chinese character for "middle") encodes as three bytes: `11100100 10111000 10101101` (0xE4 0xB8 0xAD).

UTF-16

UTF-16 uses 2 bytes (16 bits) per character for most common characters, and 4 bytes for less common ones (using surrogate pairs). It is commonly used by Windows internally and by the Java and .NET runtimes. However, UTF-16 has significant disadvantages for subtitles:

Larger file sizes: for English text (2 bytes per character instead of 1)

Byte order issues: Big-endian vs little-endian must be distinguished, requiring a BOM

Compatibility: Many subtitle parsers on Linux and embedded devices do not handle UTF-16 correctly

Not ASCII compatible: UTF-16 files cannot be read as ASCII text — they contain null bytes between ASCII characters

BOM dependency: Without a BOM, the byte order cannot be determined, leading to garbled text

Detailed BOM Explanation

The Byte Order Mark (BOM) is a special Unicode character (U+FEFF) placed at the beginning of a text file. Its purposes are:

Signaling that the file is Unicode

Indicating the byte order (endianness) for UTF-16 and UTF-32

Identifying which Unicode encoding is used

BOM in Different Encodings

|----------|-----------|---------|-------------|

The UTF-8 BOM Problem

The UTF-8 BOM (bytes EF BB BF) is particularly problematic for subtitles:

Technically optional: The Unicode standard explicitly states that a BOM is not required for UTF-8

No byte order ambiguity: UTF-8 has no endianness, so a BOM serves no useful purpose

Causes parsing errors: Many subtitle parsers (especially on Linux and macOS) treat the BOM as literal characters. The three bytes EF BB BF are invisible in some editors but appear as an empty character, a zero-width no-break space, or a garbled character at position 0

Platform fragmentation: Windows Notepad adds a BOM by default when saving as "UTF-8". Mac and Linux tools typically do not

Best practice for subtitles: Always save as UTF-8 without BOM. This ensures maximum compatibility across all platforms, players, and devices.

If you receive a UTF-8 file with BOM, use our Online Editor to detect and remove it automatically.

Platform-Specific Encoding Issues

Windows

Notepad: Defaults to ANSI (system locale encoding). Must explicitly select "UTF-8" in the Save As dialog. When you do select UTF-8, it saves **with** BOM by default.

Notepad++: Shows encoding in the status bar. Supports "UTF-8 without BOM" — always select this option.

Windows PowerShell: `Out-File` and the `>` operator default to UTF-16LE. Use `Set-Content -Encoding UTF8` for UTF-8 without BOM, or `-Encoding UTF8NoBOM` in PowerShell 6+.

Legacy applications: Many old Windows subtitle editors save as ANSI (Windows-1252 on English systems, Windows-1251 on Russian systems, etc.).

Typical symptoms: Subtitles saved as ANSI display correctly on Windows but show garbled characters (\u00e2\u20ac\u201c for em dashes, \u00c3\u00a9 for \u00e9) on Mac, Linux, and smart TVs.

macOS

TextEdit: Defaults to UTF-8 but may add a BOM depending on settings. Can also save as "Mac Roman" (an older Mac encoding).

Terminal: Uses UTF-8 by default. Most Mac command-line tools handle UTF-8 without issues.

Legacy issues: Older Mac software may use Mac OS Roman encoding, which is incompatible with Windows-1252 at several code points (e.g., the 0x80-0x9F range).

Typical symptoms: Files from Windows with BOM cause an invisible character or blank line at the start of the first subtitle entry. Files from old Mac software encoded as Mac Roman show garbled characters when opened on Windows.

Linux

System encoding: UTF-8 is the standard on all modern Linux distributions.

Subtitle parsers: Most Linux video players (VLC, MPV) expect UTF-8 without BOM and may fail silently with other encodings.

Command-line tools: Automation scripts using `sed`, `awk`, grep, or Python assume UTF-8 input. Non-UTF-8 files may cause script failures.

Nautilus / GNOME: File manager and text editor default to UTF-8.

Typical symptoms: UTF-16 files from Windows are often rejected entirely or appear as garbled text. BOM characters cause parsing issues in subtitle tools.

Smart TVs and Embedded Devices

Limited parsers: Many smart TVs have minimal subtitle parsers that only handle basic UTF-8 without BOM.

Character set issues: Some TVs only support Latin-1 and basic Cyrillic, failing on CJK characters entirely (displaying empty boxes).

Recommended: Test with UTF-8 without BOM and limit to characters supported by the target device. For maximum compatibility, stick to common scripts.

Common Encoding Problems and Their Symptoms

|---------|-------------|--------------|----------|

Comprehensive Fix Workflow

Step 1: Detect the Current Encoding

Before fixing, you need to know what you are dealing with:

Method A: Use our Online Editor. Upload the file and the tool automatically detects its encoding and reports it.

Method B: Notepad++ — the status bar shows the encoding (e.g., "UTF-8", "ANSI", "UTF-16 LE BOM").

Method C: VS Code — the bottom-right corner shows the encoding. Click it to see details or change.

Method D: PowerShell BOM detection:

```powershell

$bytes = Get-Content -Path "file.srt" -Encoding Byte -TotalCount 4

if ($bytes[0] -eq 0xEF -and $bytes[1] -eq 0xBB -and $bytes[2] -eq 0xBF) {

"UTF-8 with BOM"

} elseif ($bytes[0] -eq 0xFF -and $bytes[1] -eq 0xFE) {

"UTF-16 LE"

} elseif ($bytes[0] -eq 0xFE -and $bytes[1] -eq 0xFF) {

"UTF-16 BE"

} elseif ($bytes[0] -eq 0 -and $bytes[1] -eq 0 -and $bytes[2] -eq 0xFE -and $bytes[3] -eq 0xFF) {

"UTF-32 BE"

} elseif ($bytes[0] -eq 0xFF -and $bytes[1] -eq 0xFE -and $bytes[2] -eq 0 -and $bytes[3] -eq 0) {

"UTF-32 LE"

} else {

"No BOM detected — likely UTF-8 without BOM or ANSI/Latin-1"

}

```

Method E: Visual inspection. Open the file in a text editor. If you see characters like Ã\u00a9 (instead of \u00e9), \u00c7 (instead of \u00fc), or \u00e2\u20ac\u2122 (instead of \u2019), the file is UTF-8 content being interpreted as Latin-1/Windows-1252.

Step 2: Identify the Problem

Use the symptom table in the previous section to match what you see with the likely cause. The key patterns to recognize:

Mojibake with accented letters: Ã\u00a9 = \u00e9, Ã¼ = \u00fc, Ã± = \u00f1 — this is almost always UTF-8 read as Windows-1252

Ã\u00a9Ã¼Ã± pattern: The telltale sign of UTF-8 Latin-1 confusion

â\u20ac\u201c and â\u20ac\u2122: Smart quotes and dashes in UTF-8 read as Windows-1252

Empty boxes at the end of file: Characters from a script not supported by the current encoding

Step 3: Convert to UTF-8 Without BOM

Option 1: Use our Online Editor — upload, select "UTF-8 without BOM" as target, preview, download.

Option 2: Notepad++:

Open the file

Encoding menu \u2192 check current encoding

If not UTF-8: Encoding \u2192 Convert to UTF-8

If BOM is present: Encoding \u2192 Encode in UTF-8 without BOM

Save

Option 3: VS Code:

Open the file

Click encoding in bottom bar

Select "Save with Encoding" \u2192 "UTF-8"

Save (VS Code does not add BOM by default)

Option 4: PowerShell:

```powershell

# Read with source encoding, write as UTF-8 without BOM

$content = Get-Content -Path "input.srt" -Encoding UTF8

Set-Content -Path "output.srt" -Value $content -Encoding UTF8

```

For UTF-16 or ANSI source files, specify the encoding explicitly:

```powershell

# UTF-16 LE (common Windows export)

$content = Get-Content -Path "input.srt" -Encoding Unicode

Set-Content -Path "output.srt" -Value $content -Encoding UTF8

# ANSI / Windows-1252

$content = Get-Content -Path "input.srt" -Encoding Default

Set-Content -Path "output.srt" -Value $content -Encoding UTF8

```

Option 5: Linux/Mac command line with iconv:

```bash

# Detect encoding

file -I input.srt

# Convert from detected encoding to UTF-8

iconv -f WINDOWS-1252 -t UTF-8 input.srt > output.srt

# Remove BOM if present

sed -i '1s/^\xEF\xBB\xBF//' output.srt

```

Step 4: Validate the Conversion

Open the converted file in the [Online Editor](/tools/editing/online-editor) — verify all characters render correctly

Test in VLC Media Player — the most reliable test for subtitle rendering

Test on the target platform (YouTube, smart TV, etc.)

Check for any remaining mojibake, question marks, or empty boxes

Verify the file size is appropriate (UTF-8 files are typically slightly larger than Latin-1 equivalents for text with many accents)

Step 5: Update Your Workflow

Configure your text editor to always save as UTF-8 without BOM by default

Use our [Online Editor](/tools/editing/online-editor) as your primary subtitle editing tool

Include encoding validation in your quality assurance process

Maintain a master copy in UTF-8 without BOM and convert to other encodings only when necessary for legacy platforms

Encoding Conversion Reference Table

|-----------------|---------------------------|-----------|---------------------|

Testing Checklist

Before finalizing any subtitle file, verify all items on this checklist:

|---|-------|-------------|---------------|

Best Practices Summary

Always use UTF-8 without BOM — This is the universal standard supported by all modern platforms, players, and devices

Configure your editor — Set your text editor to default to UTF-8 without BOM so you never accidentally save in another encoding

Test on VLC first — VLC is the most reliable test environment. If subtitles display correctly in VLC, they will work on most desktop platforms

Validate after every conversion — Never assume a conversion preserved encoding correctly. Always open and verify the output file

Keep original files — Before changing encoding, keep a backup of the original file in case the conversion has issues

Use dedicated tools — Our [Online Editor](/tools/editing/online-editor) handles encoding detection, conversion, and validation automatically

Be consistent — Use the same encoding for all files in a project. Mixing encodings causes confusion and errors

Avoid UTF-16 for subtitles — Despite being common on Windows, UTF-16 causes compatibility issues with many subtitle parsers outside the Windows ecosystem

Beware of double encoding — Converting an already UTF-8 file to UTF-8 again (or saving a UTF-8 file as Latin-1 then back to UTF-8) causes double encoding corruption

Document your encoding choices — If working in a team, document which encoding to use for source files, translations, and delivery

Related Tools

[Online Editor](/tools/editing/online-editor) — detect, fix, and convert encoding

[Batch Converter](/tools/conversion) — convert encoding for multiple files simultaneously

[Remove Duplicates](/tools/cleanup/remove-duplicates) — clean up encoding artifacts in subtitle text

[SRT to VTT Converter](/tools/conversion/srt-to-vtt) — convert formats while preserving UTF-8 encoding

[Remove SDH](/tools/cleanup/remove-sdh) — clean accessibility markers (works best with proper encoding)

Conclusion

Character encoding is the invisible infrastructure that ensures your subtitles display correctly regardless of language, platform, or device. UTF-8 without BOM is the gold standard — universal, efficient, and compatible with all modern systems. By understanding how encoding works, knowing the common pitfalls on each platform, and following a systematic fix workflow, you can eliminate encoding problems from your subtitle workflow entirely.

Use our Online Editor to detect, fix, and convert encoding automatically, or browse our subtitle tools for more specialized utilities.