DEV Community

Cover image for What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas
Abrar ahmed
Abrar ahmed

Posted on

What I Learned Cleaning 1 Million Rows of CSV Data Without Pandas

Cleaning a small CSV? Pandas is perfect.
Cleaning up a million rows on a limited machine or using a serverless function? That's when Pandas really struggles.

That’s exactly the problem I faced.

In this post, I’ll share:

  • Why I avoided Pandas
  • My Node.js pipeline with csv-parser
  • How I handled common data issues: dates, phone numbers, missing fields
  • What I’d do differently next time
Let’s dive in.

Why Not Pandas?

Pandas is fantastic, but it does have a downside: it loads the entire file into memory. If you're working with files larger than about 500MB, you might run into some issues.

  • Memory errors
  • Slow performance
  • Crashes in limited environments (e.g., cloud functions, small servers)

In my case, I had:

  • 1 million+ rows
  • Dirty data from multiple sources
  • A need to stream and clean data row by row

My Setup: Streaming CSV Processing in Node.js

Here’s the core pipeline using csv-parser and Node streams:

const fs = require('fs'); const csv = require('csv-parser'); let rowCount = 0; let errorCount = 0; fs.createReadStream('bigfile.csv') .pipe(csv()) .on('data', (row) => { rowCount++; // Clean data row.email = cleanEmail(row.email); row.phone = cleanPhone(row.phone); row.date = parseDate(row.date); // Validate required fields if (!row.email || !row.date) { errorCount++; logError(row); return; } // Save row to DB, another file, or API... }) .on('end', () => { console.log(`✅ Processed ${rowCount} rows`); console.log(`⚠️ Found ${errorCount} bad rows`); }); function cleanEmail(email) { return email?.trim().toLowerCase() || null; } function cleanPhone(phone) { const digits = phone?.replace(/\D/g, ''); return digits?.length === 10 ? digits : null; } function parseDate(date) { const parsed = Date.parse(date); return isNaN(parsed) ? null : new Date(parsed).toISOString(); } function logError(row) { fs.appendFileSync('errors.log', JSON.stringify(row) + '\n'); } 
Enter fullscreen mode Exit fullscreen mode

Common Data Issues I Ran Into (and How I Fixed Them)

  • Inconsistent date formats (MM-DD-YYYY vs DD/MM/YYYY) → Used Date.parse() + fallback logic.
  • Phone numbers in weird formats → Removed non-digits, validated length
  • Missing fields → Set defaults or marked as null
  • Extra columns → Stripped to schema fields
  • Encoding problems → Saved CSVs as UTF-8

Pro Tips for Large CSV Cleaning

  • Stream, don’t load → Avoid memory issues by processing row by row
  • Validate early → Catch bad data before it pollutes your system
  • Log errors → Keep a separate file of rejected rows for review
  • Test on a small sample → Always test your logic before full-scale runs
  • Handle edge cases → Empty cells, extra commas, inconsistent headers—these will happen!

What I’d Do Differently Next Time

  • Use a schema definition (like JSON Schema or Zod) to validate and transform rows automatically
  • Build a mapping layer for multi-source CSVs (e.g., different column names)
  • Consider tools like DuckDB or Polars if I need more advanced queries

Final Thoughts

Handling big data files involves more than just coding; it’s about crafting durable pipelines that can navigate the complexities and messiness of real-world scenarios.

If you’re working with CSVs, remember:

  • Validate early
  • Clean thoughtfully
  • Log everything

And when in doubt, stream it, don’t load it all at once.

Have you ever tackled the challenge of cleaning up a huge dataset? What tools or tips have you found to be the most helpful? I’d love to hear your thoughts!

Top comments (0)