How to Extract Email Addresses From Any Block of Text
A practical guide to pulling every email address out of logs, exports, and copied pages with regex, then deduping and cleaning the list locally.
How to Extract Email Addresses From Any Block of Text
Every contact list I have ever inherited arrived as a mess. A support transcript with forwarded headers. A CSV export where someone pasted three columns into one. An HTML page copied straight out of the browser, angle brackets and all. Somewhere inside that noise are the email addresses I actually need, and the job is to fish them out without dragging the surrounding junk along with them.
This post walks through how that extraction works, why a regex is the right tool for it, and how to turn raw text into a clean, deduplicated list you can hand to a CRM or a script.
What an email address actually looks like to a parser
The thing that makes extraction possible is that an email address has a strict, recognizable shape: user@domain. A local part (letters, digits, and a handful of symbols like dots, plus signs, and hyphens), a single @, then a domain made of labels separated by dots, ending in a top-level domain of at least two letters.
A matcher built around that shape — roughly [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,} — scans the text character by character and reports each span that fits the user@domain form. It does not care what surrounds the match. A line like Contact: jane.doe+sales@acme.co (primary) yields exactly jane.doe+sales@acme.co. The word "Contact:", the parentheses, and "(primary)" are left behind because they never matched the pattern. That is the whole trick: the regex describes the email, not the sentence, so context falls away on its own.
The same logic pulls addresses out of HTML (<a href="mailto:bob@host.io">), out of log lines, out of Markdown, and out of CSV cells, because all of those formats still contain the literal user@domain substring somewhere inside their markup.
A worked example: messy paste in, clean list out
Suppose I paste this into the Email Address Extractor:
From: "Dana Lee" <dana.lee@northwind.example>
Cc: ops@northwind.example, ops@northwind.example
Reply to support@northwind.example or sales+eu@northwind.example.
Bad one: typo@@northwind.example
Footer mailto:newsletter@northwind.example?subject=Hi
The parser scans every line and returns a table. With "keep unique only" turned on, the normalized list comes back as:
dana.lee@northwind.example
ops@northwind.example
support@northwind.example
sales+eu@northwind.example
newsletter@northwind.example
Five clean addresses out of seven raw hits. The duplicate ops@northwind.example collapsed to one row. The ?subject=Hi query string attached to the mailto: link was trimmed off. And typo@@northwind.example, with its double @, is flagged as invalid rather than silently dropped — so you can see the catch and decide what to do with it instead of trusting a filter you cannot inspect.
Cleaning scraped or exported data
Extraction is only the first half. Real source data is dirty in ways that bite you later if you import it as-is.
- Hidden whitespace. Text copied from a web page often carries non-breaking spaces or trailing tabs glued to an address. Normalize before you compare anything, or
bob@host.ioandbob@host.iowill read as two different people. - Case noise.
Bob@Host.ioandbob@host.ioare the same mailbox in practice. Lowercasing the domain (and usually the local part) before deduping keeps your count honest. - Near-duplicates from multiple exports. When you merge two CSVs, the same address shows up with different surrounding formatting. Dedup on the normalized value, not the raw string.
The extractor produces an audit table with the original line number, the normalized value, a validity flag, and a reason. That line number matters: when a borderline address shows up, you can jump back to the exact spot in the source and see whether it was a real contact or a parser artifact from some stray @ in a file path.
Why local processing is the right default
The addresses you extract are almost always somebody's personal data — customers, leads, internal staff. The moment you paste that into a web form that ships it to a server, you have created a copy of that data outside your control, and you usually cannot prove what happened to it afterward.
Everything in this tool runs in your browser. The parsing, the validation, the dedup, the copy, the download — all of it executes on the page, and uploaded text files are read locally through the File API rather than sent anywhere. For a marketer cleaning a lead list or an operations engineer scrubbing a support log, that is the difference between a routine cleanup task and a privacy incident waiting to be explained.
It is also just faster. There is no round trip. Paste a few megabytes of a copied inbox thread, and the list appears immediately. For a giant mbox archive, split it into pieces locally first rather than fighting one enormous paste.
Exporting into the shape you need
Once the list is clean, the format you want depends on where it is going. The extractor switches between plain lines, CSV, JSON, Markdown, a SQL IN (...) clause, and a TypeScript union type. That last pair saves real time: dropping a clean address list into IN ('a@x.com', 'b@y.com') by hand means adding quotes and commas to every row and hoping you did not miss one. Generating it removes the chance of a typo.
A note worth repeating: format validity is not proof of existence. A regex confirms user@domain is well-formed, not that the mailbox accepts mail or that the domain resolves. Treat the output as a clean, deduplicated candidate list, then verify deliverability separately if that matters for your send.
When you only need part of the job
If you already have a list and only need to collapse duplicates — say two merged exports with overlapping contacts — the dedicated Email Address Deduplicator does that one step without the full extraction pass. Reach for the full extractor when your starting point is raw, mixed text; reach for the narrower tool when the addresses are already isolated and you just need the duplicates gone.
The core idea holds across all of it: an email address is a pattern, and a pattern is something you can match, normalize, and count with confidence. Once you stop reading the text and start matching the shape, a wall of noise turns into a list you can actually use.
Made by Toolora · Updated 2026-06-13