[Cross-posted from ICanHazDataScience]
Google should add Lolcatz to translate.google.com…
But seriously, if you want to do data science, you’re going to need data. Which means being able to access data: data in streams (like Twitter), data online (like websites) and data in files. We’ll start with the files.
Development agencies, bless them, have a major love affair going on with Excel files. Open any development data site (or crawl it for datafiles) and you’ll see overwhelming numbers of reports (we’ll get to those later), .xls and .csv files.
Let’s start with the easier filetype: CSV. CSV (Comma-Separated Values) is as it sounds: a file containing readable text, usually separated with commas, but sometimes instead with semicolons “;” tabs, colons “:”, pipes “|” or anything else that the person writing the file thought was appropriate (data.un.org, for example, gives you the choice of comma, semicolon or pipe). Here’s an example, seen from Microsoft Excel:
|CSV file as seen in Microsoft Excel|
If you open this file in a text editor (I’m using Notepad++ here – which is heartily recommended for looking at webpage files too), it looks like this:
|CSV file as seen in Notepad++|
There are two things to note here: first, that the columns in the Excel view correspond to the commas in the Notepad++ one. And second, that the comma in Excel (field C3) doesn’t create a new column in the CSV file because Excel wraps the whole field in punctuation marks (“This is a comment, that includes a comma.”).
Yes, it’s that simple a file format. And there’s a friendly Python library to make reading and writing CSV files even easier. It’s called “CSV“. Here are some simple bits of code that use it to read a CSV file, print out its contents, and write them out to another CSV file.
fin = open(“infile.csv”, “rb”)
fout = open(“outfile.csv”, “wb”)
csvin = csv.reader(fin)
csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)
#Read in header row
headers = csvin.next()
for header in headers:
#Read in non-header rows
for row in csvin:
for col in range(0,len(row)):
csvout.writerow([“header1”, “header2”, “header3”])
for row in csvin:
Okay, a few things happened here. First, the setup:
- Before you can use a library (e.g. CSV), you have to tell your code about it. That’s what the “import CSV” line is about.
- Second, you have to open a file before you can use the CSV library on it. The line “fin=open(“infile.csv”, “rb”)” opens a file called “infile.csv” for reading on my windows system (that’s the “rb” part), and points to the open file with a variable called “fin”. The line “fout-open(“outfile.csv”. ‘wb”) does a similar thing, opening a file called “outfile.csv” for writing (that’s the “wb” part).
- csv.reader and csv.writer are where the CSV library gets involved. “csvin = csv.reader(fin)” gives you an object, csvin, with lots of handy ways to read in CSV files. “csvout = csv.writer(fout, quoting=csv.QUOTE_NONNUMERIC)” gives you an object, csvout, with lots of handy ways to write to CSV files. That “quoting=csv.QUOTE_NONNUMERIC” is important by the way – it wraps your text fields in quotes (“) so you never get a comma in the text being read in as a new column in the CSV file, and CSV readers *always* read in each field as a text string (this is important because you don’t have to write extra code that worries about which type each input is).
Then the read:
- CSV files are organised in rows, so it’s not too surprising that CSV readers do the same thing. “headers = csvin.next()” picks the next row out of the CSV file, and puts it into an array variable called “headers”. Arrays are lists of things that look like this: [“Id”, “Title”, “Comments”, “Date”] in Python – the next lines “for header in headers:” and “print(header)” select each item in the array headers and prints it to the screen (this is where QUOTE_NONNUMERIC wins: if you didn’t use it, then any numbers in the CSV file will crash this piece of code. Unless you print(str(header)) here).
- “for row in csvin:” is a loop – it picks the next row from the CSV file, and applies the following code to it (note the lack of brackets here: the only way you know what a loop applies to is by looking at the indents in the line), then repeats this with the next row in the CSV file until it gets to the end of the file (at which point it stops and your program goes on to the next piece of code). Note that “for header in headers:” is also a loop, but it’s working on a different type of object (an array rather than a file), so it works on each item in the array instead.
- row is an array, e.g. [“1”, “Title 1′, “This is a comment on title 1. “, “3/10/2013”], which means that you can access any column in that row by typing “row[col]” where col is any number from 0 to the number of columns in the file, minus 1. I’ve used this to print out the row contents in order, but you could use it to pick out any data field that you want.
And the write:
- Writing is pretty easy: once you’ve created a csv.writer object, e.g. csvout, you can use “csvout.writerow(array)” to write a row of data to the file (e.g. “outfile.csv”) associated with it.
And tidy up:
- Closing your files is good practice – which is why “fin.close()” and “fout.close()” are here. You’ll sometimes see CSV readers and writers created in code using lines like “csvin = csv.reader(open(“infile.csv”, “rb”))” – this makes it difficult to close the files when you finish, and could get messy over time.
Well, that’s a small short look at reading and writing CSV files. Nothing much happened in the example code, but use a little imagination and you could be reading the files into your processing code, swapping around file columns, reformatting column contents and cross-matching them with the contents of other interesting files. That part… is up to you.