A Fast CSV Statistics Summary for Sanity-Checking Exported Data

Most data problems are not subtle. A price column has a value of -1. An export that should hold 4,000 rows holds 3,812. A "country" field that should contain a handful of distinct values somehow has 900. None of these need a model or a dashboard to catch. They need someone to look at the shape of the file for ten seconds before it moves to the next step. The trouble is that opening a CSV in a spreadsheet to do that is slow, and on a large file it is genuinely painful.

A statistics summary closes that gap. The CSV Stats Summary tool reads every column and reports the things you would otherwise compute by hand: how many cells are filled, how many are empty, how many distinct values exist, and for any column that looks numeric, the minimum, maximum, mean, median, and sum. You paste the file or load it locally, and the profile comes back as Markdown you can read or drop straight into a ticket.

What the summary actually tells you

The point of a per-column profile is that each number is a small assertion about your data, and a broken assertion is obvious the moment you see it.

For every column, the tool reports a filled count and an empty count. If a "customer_id" column shows 200 empty cells, that is a join key with holes in it, and you want to know before the import fails halfway through. It reports a distinct count, which is how you catch cardinality surprises: an "is_active" flag with three distinct values instead of two means someone wrote Yes, yes, and TRUE into the same column.

For numeric columns it adds count, min, max, mean, median, and sum. This is where outliers announce themselves. A min of -1 on a quantity column, a max of 9999999 on an age column, a sum that is an order of magnitude off from what finance expected — each of these is a single glance away. Because it reports each column independently, you do not have to know in advance which one is broken. You scan the min/max/null line for every column and the wrong one stands out.

A worked example

Here is a small orders export. Imagine the file orders.csv:

order_id,region,quantity,unit_price,total
1001,US,2,19.99,39.98
1002,EU,1,24.50,24.50
1003,US,3,19.99,59.97
1004,APAC,1,24.50,24.50
1005,US,,19.99,19.99
1006,EU,250,24.50,6125.00

Six rows, nothing screaming for attention at a glance. Now read the summary:

order_id — filled 6, empty 0, distinct 6, numeric. Good: a unique key with no gaps.
region — filled 6, empty 0, distinct 3 (US, EU, APAC). Reasonable.
quantity — filled 5, empty 1, numeric, min 1, max 250, mean 51.4, median 2, sum 257.
unit_price — filled 6, distinct 2, min 19.99, max 24.50.
total — filled 6, min 19.99, max 6125.00, sum 6293.94.

Two problems jump out, and neither required staring at rows. First, quantity is missing one value (filled 5 of 6), which is why row 1005 has a blank cell. Second, the gap between the mean (51.4) and the median (2) on quantity is enormous — a textbook outlier signature. One row drags the average up while half the values sit at 1 or 2. Tracing it down, row 1006 has a quantity of 250 and a total of 6125.00, which is either a genuine bulk order or a fat-fingered entry. The summary did not decide for you, but it pointed at the exact row in seconds.

Why I run it before anything else

I used to skip this step and pay for it later. The pattern was always the same: I would load an export into a downstream tool, build a transform, and only then discover that a "date" column had 40 blanks or a "price" field included a $ that broke my parser three steps in. By then the failure was buried under everything I had built on top of it.

Now the first thing I do with any unfamiliar CSV is run the summary. It takes longer to describe than to do. I look at three things — the empty counts, the distinct counts, and the numeric min/max pairs — and most of the time everything is clean and I move on with confidence. When something is off, I have caught it before writing a single line of transform logic, which is the cheapest possible place to catch it.

It runs locally, in your browser

This matters more than it sounds. Profiling data often means looking at files you should not upload: customer lists, internal financials, anything with personal information. This tool does all of its parsing and counting in the browser. Nothing leaves your machine, so you can profile a sensitive export without routing it through a server you do not control.

The one caution that comes with local processing is the output itself. The Markdown summary can include column names and aggregate values that are themselves sensitive — a salary column header and its max, for instance. The computation is private, but the summary you copy out is not automatically safe to paste in a public channel. Read it before you share it.

Where it fits in a cleanup workflow

A profile is a starting point, not a destination. It tells you what is wrong; fixing it is a separate move. Once the summary flags an issue, the natural next steps are quick and specific.

If the distinct count is higher than it should be because of duplicate rows, run the file through the CSV Deduplicator and re-profile to confirm the count dropped to what you expected. If the problem is messy headers — mixed casing, stray spaces, inconsistent naming — normalize them first so the rest of your pipeline reads clean column names. If you only need the numbers from one column for a back-of-envelope check, the basic statistics calculator gives you the same min/max/mean/median on a single list without the surrounding columns.

The summary is also useful as a record. Because the output is Markdown, it copies cleanly into a pull request, an incident note, or a data-handoff doc. A reviewer who sees "filled 5,812 / 6,000, distinct 3, no negative values" understands the state of the file without opening it, and the profile becomes a small piece of evidence that the data was checked rather than assumed.

A statistical summary will not clean your data and it will not replace a real analysis when you need one. What it does is turn "I think this export is fine" into something you actually verified, in the time it takes to paste a file. For most CSVs, that ten-second check is the difference between a smooth import and a debugging session three steps later.

Made by Toolora · Updated 2026-06-13