Skip to main content

How to Fix Mojibake and Garbled Text in Files, Forms, CSVs, and Copied Content

A repair playbook for mojibake: reverse é-style corruption, rescue CSVs that Excel mangled, stop forms from double-encoding, and know when the data is truly gone.

Published By Li Lei
#mojibake #encoding #utf-8 #csv #debugging

How to Fix Mojibake and Garbled Text in Files, Forms, CSVs, and Copied Content

Mojibake is what you get when bytes written under one character encoding are read under another: café turns into café, “smart quotes” turn into “smart quotesâ€, and 张伟 turns into å¼ ä¼Ÿ. The good news is that most mojibake is a reversible transformation, not data loss. This playbook covers the four places it shows up most — saved files, web forms, CSV round-trips through Excel, and text copied between apps — and gives you a concrete repair for each.

Before touching anything, one rule: never "fix" garbled text by hand-editing the visible characters. The corruption happened at the byte level, and the byte-level inverse is what repairs it. Hand edits destroy the evidence you need.

Read the damage pattern first

The shape of the garbage tells you which mistake produced it, and whether it is reversible.

  • é, ü, ñ — one weird pair per accented letter. UTF-8 bytes were decoded as Windows-1252 or Latin-1. é is the two UTF-8 bytes C3 A9; read one-byte-at-a-time under Windows-1252, C3 renders as à and A9 as ©. Fully reversible.
  • “, â€, ’ — three-character clusters around quotes. Same bug, but the victim is a curly quote or dash from the U+2018–U+201D range, which UTF-8 encodes as three bytes ( is E2 80 9C). Fully reversible.
  • é — the pairs have pairs. Double mojibake: the text was corrupted, the corrupted text was re-encoded as UTF-8, and then misread again. Reversible, but you have to unwind it twice.
  • (U+FFFD) or runs of ?. A decoder hit bytes it could not interpret and substituted a replacement character, or a converter dropped everything outside its target set. Not reversible — the original bytes are gone. Restore from a backup or the upstream source.

You can confirm a diagnosis instead of guessing. Paste a broken character like à into the Unicode code point explorer and you will see it is U+00C3, whose Windows-1252 byte value C3 is exactly the UTF-8 lead byte of é — the smoking gun for the classic misread. The UTF-8 byte counter is useful the same way: if a string you believed was plain ASCII reports more bytes than characters, multi-byte sequences are hiding in it.

Fixing files: reopen, don't convert

When a text file displays mojibake in your editor, the file is usually fine — the editor picked the wrong decoder. In VS Code, click the encoding indicator in the status bar and choose Reopen with Encoding, then try UTF-8 first and Windows-1252 second. Between them these two cover the overwhelming majority of real files: per W3Techs (2024), UTF-8 alone is the declared encoding of over 98% of websites, and Windows-1252 is the dominant legacy encoding in files produced by older Windows tooling.

Once the file displays correctly, use Save with Encoding → UTF-8 to make it permanent. Only reach for a converter when a file genuinely is in a legacy encoding and you need UTF-8 output:

iconv -f WINDOWS-1252 -t UTF-8 legacy.txt > fixed.txt

If the damage is already baked into stored text (say, a database column full of café), reverse the exact transformation that caused it. In Python:

>>> "café".encode("windows-1252").decode("utf-8")
'café'

Input: café. Output: café. The encode("windows-1252") step recovers the original bytes 63 61 66 C3 A9, and decoding those as UTF-8 restores the text. For double mojibake, run the same round-trip twice. Test on a handful of rows before running it across a table, because any row that was correct to begin with will be corrupted by the repair.

Fixing forms: pin the charset at every hop

Form mojibake means the browser and the server disagreed about encoding somewhere between the keyboard and the database. Work through the hops in order:

  1. The page. Serve Content-Type: text/html; charset=utf-8 and include <meta charset="utf-8"> in the first 1,024 bytes of the document. Browsers submit forms in the encoding of the page that contains them.
  2. The connection. For MySQL, set the client connection charset explicitly (charset=utf8mb4 in the DSN). A UTF-8 app writing through a Latin-1 connection is the single most common source of stored é I have seen.
  3. The column. Use utf8mb4, not MySQL's legacy utf8, which stores at most 3 bytes per character. Any emoji or rarer CJK character needs 4 and will either throw Incorrect string value or be truncated at the first such character.

If names like José arrive broken only from some users, suspect hop 1 on a cached or third-party page; if everything is broken, suspect hop 2.

Fixing CSVs: Excel is the crime scene

CSV mojibake almost always involves Excel on Windows, which ignores the fact that a .csv is UTF-8 unless the file starts with a byte-order mark (BOM, the bytes EF BB BF). Open a BOM-less UTF-8 export containing 张伟 by double-clicking and you get å¼ ä¼Ÿ.

I hit this personally while testing a customer-name export: the file was valid UTF-8, rendered perfectly in VS Code and in less, and still came up garbled in Excel 2021 on a Windows 11 machine. Nothing about the file was wrong. Two fixes work:

  • On the producing side, write the BOM: in Python, open the output file with encoding="utf-8-sig"; in pandas, pass encoding="utf-8-sig" to to_csv. Excel then decodes correctly on double-click.
  • On the consuming side, never double-click. In Excel use Data → From Text/CSV and pick 65001: Unicode (UTF-8) as the file origin.

After rescuing the encoding, check the header row too — exports that have been through Excel often pick up stray spaces, BOM remnants glued to the first column name, or invisible characters that break downstream imports. The CSV header normalizer cleans those in one pass.

Fixing copied content: the invisible-character tax

Text copied from Word, Google Docs, Slack, or a web page carries characters that look identical to their plain cousins but are not: curly quotes ( U+2019) where code expects ', non-breaking spaces (U+00A0) that break shell commands and YAML, zero-width spaces (U+200B) that make two identical-looking strings compare unequal, and accented letters in decomposed form — é as e + U+0301 instead of the single U+00E9, which fails equality checks and duplicates database keys.

When a config value "looks right" but the parser rejects it, inspect it character by character with the Unicode character inspector — an invisible U+00A0 or U+200B shows up immediately. For the composed-vs-decomposed problem, run the text through the Unicode normalizer to NFC form before comparing or storing; NFC is what the W3C recommends for the web, and it makes both spellings of é byte-identical.

The five-minute repair checklist

  1. Classify the pattern: é-style pairs are reversible; and ? are not.
  2. For files, reopen with the right encoding before converting anything.
  3. For stored text, reverse the exact byte transformation (encode("windows-1252").decode("utf-8")), verified on samples first.
  4. For forms, pin UTF-8 at page, connection, and column — and use utf8mb4 in MySQL.
  5. For CSVs, add a BOM for Excel or import via From Text/CSV; for pasted text, inspect and normalize before trusting it.

Mojibake looks like chaos, but it is deterministic chaos: the same two encodings collide in the same handful of ways, and every reversible case yields to the same round-trip repair.


Made by Toolora · Updated 2026-07-02