Do NOT Store Your Data as a CSV File

This was a hard lesson to learn that seems obvious in hindsight.

When I was a less-experienced developer, I was working on a web scraper where I took took data from an online source and appended it to a Pandas DataFrame. When I was done scraping I then converted the DataFrame into a CSV file which felt like a natural way to store data.

If you’re unfamiliar, a CSV file is simply a text document separated by commas (or a specific character) where the first row is the name of each column like so:

Year,Make,Model,Description,Price
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!

My process became:
Load stored data -> check if the rows already exist -> scrape the data -> store the data. Repeat.

This intuition behind using a CSV wasn’t unfounded: if you’ve ever used Kaggle, you know that data is often presented as a CSV file. This is because it’s a universal (aka basic) format that all software can read in and you can quickly load the data into memory and start doing statistics or machine learning with.

I started to notice as my CSV file grew in size, sections of my data started mysteriously disappearing with no error message. This is incredibly scary, but I was eventually able to figure out what was happening. During the writing of the data to a CSV sometimes the data would be loaded (to check what data already existed in the database) and the this would cause only part of the table to be loaded.

Databases have built in protocols that easily allow for multiple users to access them at the same time (called concurrency). Databases exist for a reason and they are awesome because they provide a lot of built-in features you don’t have to code from scratch. Use them for non-trivial projects!

Share this:

Related

Leave a comment Cancel reply