Data Science

Readin and Writin Excel Files

Download PDF

[Cross-posted from ICanHazDataScience]

And so to Excel.  Many people think Excel must be difficult to read into and write from a program.  Nah!  Again, there are libraries to help you: this time, it’s xlrd and xlwt (for reading Excel and writing Excel, respectively).

There’s a nice tutorial on using Python on Excel files here, so I’ll just give some example code in this post. Here’s the file I’m running it on:

Example Excel file

And here’s the code:

import xlrd

import xlwt

 

#set up

wbkin = xlrd.open_workbook(“infile.xls”)

wbkout = xlwt.Workbook()

 

#Read in data from Excel file

numsheets = wbkin.nsheets

sh=wbkin.sheet_by_index(0):

print(sh.nrows)

print(sh.ncols)

print(sh.cell_value(0,1))

merges = sh.mergedcells()

 

#Print out contents of first worksheet

for r in range(0, sh.nrows):

for c in range(0, sh.ncols):

print(“(“+str(r)+”,”+str(c)+”)”+str(sh.cell_value(r,c)))

 

#Write data out to Excel file

sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)

sheet.write(0,0, “cell contents”)

sheet.write(0,1, “more in the first row”)

sheet.write(1,0, “contents in the second row”)

wbkout.save(“outfile.xls”)

So what’s going on in here?   First, the setup.

  • wbkin = xlrd.open_workbook(“infile.xls”) creates a readable workbook object (the thing you use to access the input file) linked to the file “infile.xls”.  Note that this is a “‘xls” Excel file: xlrd can handle Excel 2010 files (“.xlsx”) but so far, they’ve given me errors (this might be something to do with me working on a 64-bit Windows machine though).
  • “wbkout = xlwt.Workbook()” creates a writable workbook object.  Note that this isn’t attached to any file at the moment – and that if you don’t use the command “wbkout.save(“outfile.xls”)” after you’ve written data into the workbook, it won’t ever make it into a file.
  • xlrd has other cool commands like “sh.col(1)”, which puts an entire column of your workbook into an array for you.

And the read:

  • “numsheets = wbkin.nsheets” gives you the number of worksheets in the input file.  In this case, there are 3 (“sheet 1”, “sheet 2” and “sheet 3”).
  • “sh = wbkin.sheet_by_index(0)” gives you a pointer to the worksheet numbered 0 (in this case “sheet 1” – note that Python counts from 0 rather than 1).  If you know what your worksheets are called, then you could use sh = wbkin.sheet_by_name(“Sheet1”) instead.
  • “sh.nrows” and “sh.ncols” gives you the number of rows and columns in the current worksheet; in this case, 5 and 3 respectively.
  • “sh.cell_value(row,col)” gives you the contents of a worksheet cell. “sh.cell_value(0,1)” gives you the contents of the first row, and the second column in that row (remember that Python counts from 0); in this case, “Title”.
  • “merges = sh.merged_cells” should give a list of the merged cells in the worksheet – which is really useful because these appear lots in development data and are a pain to unmerge.  But this command doesn’t work for me: “sh.merged_cells” gives me an empty set for the Excel file above.

Print out contents of first worksheet:

(0,0) Id

(0,1) Title

(0,2) Notes

(1,0) 1.0

(1,1) title1

(1,2) This is a note.

(2,0) 2.0

(2,1) title 2

(2,2) This is a merged cell

(3,0) 4.0

(3,1) Fourth title

(3,2)

(4,0) 5.0

(4,1) Title 5

(4,2) This is a note on line 5.

  • This is what Python thinks the contents of this worksheet are.Note the defensive piece of programming in the code: “str(sh.cell_value(r,c))” – this takes the value in the cell and converts it from whatever format xlrd thinks it is (notice that all the Ids have been turned into floating-point numbers, for instance) into a string. Which stops the print() statement from breaking because it’s been fed a number.
  • Note that xlrd happily read in all the rows and columns, but didn’t notice that one of the cells was merged.  So far, the contents of every merged cell that I’ve read in has been assigned to the row and column in its top left-hand corner (this might be helpful to you later).

And write:

  • sheet = wbkout.add_sheet(“sheetname”, cell_overwrite_ok=True)” adds a sheet named “sheetname” to the output workbook.  “cell_overwrite_ok=True” is a useful command because it lets you add cells to the workbook in a nonlinear way (e.g. you don’t have to write [0,0] then [0,1] then [0,2] then [1,0] etc etc). 
  • “sheet.write(0,1, “more in the first row”)” writes the text “more in the first row” to the cell in the first row and second column of the worksheet. 
  • “wbkout.save(“outfile.xls”)” saves the workbook to file “outfile.xls”.  Without this command, you’ll have a workbook object but it won’t be written to any files. 

That’s it for now.  As usual, if you get stuck, try searching StackOverflow and the rest of the internet for the error message you got – almost always, someone’s had the same problem before.  Onwards!