Base64 Encoding Explained for Developers: UTF-8 Bytes, Padding Rules, and the Decode Errors They Cause

Base64 is one of those formats every developer touches weekly — JWTs, data URIs, API payloads, email attachments — yet the error messages it produces are consistently cryptic. InvalidCharacterError. Incorrect padding. Text that decodes into cafÃ© instead of café. None of these errors mention what actually went wrong, and all of them trace back to two things: Base64 operates on bytes, not strings, and its 4-character block structure makes length a hard constraint.

This guide walks through the byte mechanics, then the four decode errors I see most often, each with the exact input that triggers it and the fix. Every example here can be reproduced in Toolora's Base64 Encoder/Decoder, which shows the intermediate byte view most encoders hide.

The 3-Bytes-In, 4-Characters-Out Rule

Base64, defined in RFC 4648, takes each group of 3 input bytes (24 bits), splits them into four 6-bit values, and maps each value to one of 64 ASCII characters (A–Z, a–z, 0–9, +, /).

Concrete example — the string Hi:

Input:   "Hi"           → bytes 0x48 0x69
Binary:  01001000 01101001
6-bit:   010010 000110 1001(00)   ← last group padded with zero bits
Values:  18, 6, 36
Output:  "SGk="

Two bytes only fill 2⅔ of the 6-bit groups, so the encoder pads the bit stream with zeros and appends one = to signal "the last block carried 2 real bytes." One input byte produces == (two padding chars); three input bytes produce no padding at all: Hi! encodes to SGkh, exactly 4 characters, no =.

This ratio is fixed: every 3 bytes become 4 characters, a 33.3% size increase per RFC 4648's block structure. The numbers are exact — a 1 MiB file (1,048,576 bytes) becomes ⌈1,048,576 ÷ 3⌉ × 4 = 1,398,104 Base64 characters. That overhead is why inlining large images as data URIs backfires: you pay 33% more bytes and lose separate caching for the asset.

Where UTF-8 Enters: btoa() Is a Trap

Here is the bug that generates the most confusion. In browsers, btoa() does not encode strings as UTF-8. It treats each JavaScript character as a single Latin-1 byte — and that produces two distinct failure modes.

Failure mode 1: silent corruption. Every character in café is below U+0100, so btoa("café") doesn't throw. It happily produces:

btoa("café")   // "Y2Fm6Q=="  ← é stored as single byte 0xE9 (Latin-1)

But nearly every consumer on the other end — an API, a JSON parser, a database — expects UTF-8, where é is the two-byte sequence 0xC3 0xA9. The correct UTF-8 encoding is different:

const bytes = new TextEncoder().encode("café");  // [99, 97, 102, 195, 169]
btoa(String.fromCharCode(...bytes))              // "Y2Fmw6k="

Y2Fm6Q== vs Y2Fmw6k= — both decode without any error, but the first one renders as garbage (caf�) the moment a UTF-8 decoder reads it. No exception is ever thrown. This is the classic mojibake pipeline.

Failure mode 2: a hard throw. Any character above U+00FF kills btoa outright:

btoa("你好")
// Uncaught DOMException: Failed to execute 'btoa' on 'Window':
// The string to be encoded contains characters outside of the Latin1 range.

The fix is the same TextEncoder step: convert the string to UTF-8 bytes first, then Base64 those bytes. If you're unsure how many bytes a string occupies in UTF-8 (CJK characters take 3 bytes each, emoji typically 4), Toolora's UTF-8 Byte Counter shows the per-character byte breakdown, which is exactly the view you need before predicting Base64 output length.

Padding Errors: Why Python Rejects What JavaScript Accepts

Padding is where cross-language behavior genuinely diverges, and it's why the same token decodes fine in one service and crashes another.

Take SGk — the encoding of Hi with its = stripped (JWTs strip padding by spec, and many APIs follow suit):

// JavaScript (browser)
atob("SGk")                        // "Hi" — works

# Python
import base64
base64.b64decode("SGk")
# binascii.Error: Incorrect padding

Browsers implement the WHATWG "forgiving-base64" algorithm: strip ASCII whitespace, allow missing padding, only fail when the length is impossible (length mod 4 equal to 1). Python's standard decoder demands canonical RFC 4648 form, padding included.

The fix in Python is a one-liner that pads the string back to a multiple of 4:

s = "SGk"
base64.b64decode(s + "=" * (-len(s) % 4))   # b'Hi'

The reverse pitfall exists too: atob() throws InvalidCharacterError on Base64URL input, because - and _ (which replace + and / in the URL-safe alphabet) are not in the standard table. If you're decoding JWT segments in the browser, translate the alphabet first or use a dedicated Base64URL encoder/decoder that handles both variants.

Files: The Errors Change, the Causes Don't

File workflows hit the same two root causes — byte/string confusion and length constraints — in different clothing.

When I tested a PDF round-trip for this article, I pasted a Base64 string copied from a JSON API response into a decoder and got a corrupt file. The cause took a few minutes to spot: the JSON value was a data URI, so the string began with data:application/pdf;base64,JVBERi0x... — and the data:..., prefix is not Base64. Feeding the whole thing to a strict decoder fails on the : character; feeding it to a lenient decoder silently produces garbage bytes that break the PDF's %PDF-1. magic-number header. Stripping everything up to and including the first comma fixed it instantly. Toolora's Base64 to File Converter strips that prefix automatically and validates the alphabet before writing bytes, which is exactly the guard I wish that API's docs had mentioned.

The other file-specific error is whitespace. Base64 copied from email (MIME wraps at 76 columns per RFC 2045) or from PEM certificates (64 columns) contains newlines every line. Python's b64decode with validate=True rejects them; JavaScript's atob strips them. Neither behavior is wrong — they implement different specs — but knowing which one your decoder follows turns a mystery failure into a five-second fix.

A Decode-Error Cheat Sheet

| Error | Language | Real cause | Fix | |---|---|---|---| | InvalidCharacterError | JS atob | Base64URL chars (-, _) or a data-URI prefix | Translate alphabet / strip prefix | | Incorrect padding | Python | Padding stripped (JWT-style) | s + "=" * (-len(s) % 4) | | Latin1 range DOMException | JS btoa | Characters above U+00FF | TextEncoder → bytes → encode | | Mojibake (cafÃ©, caf�) | any | Encoded as Latin-1, decoded as UTF-8 (or vice versa) | Encode UTF-8 bytes on both ends |

The pattern across all four: Base64 itself almost never fails. What fails is the byte-to-string boundary on either side of it. Check what bytes went in, check what charset reads them out, and pad to a multiple of 4 — that resolves the overwhelming majority of decode errors before you ever need a debugger.

Made by Toolora · Updated 2026-07-02