[Cross-posted from ICanHazDataScience]
Okay, that last post was a bit long for Emily… she fell asleep on my desk long before I’d finished typing. So today we’re back to short and practical.
Data is not just numbers. Numbers are one of the basic types of data that appear again and again in data science. Two of those types are words (as in written text, like this blogpost) and networks (as in objects connected with links – like a diagram of your twitter friends and your friends’ friends etc). Today we’re looking at words.
In the last post, I was looking at a set of online job descriptions. We’ll leave the basics of webpage scraping til later (but if you’re curious, ScraperWiki’s notes are good) and assume that what we have is a set of text files that we’ve used the “Processing all the teh Files in Directory” post with the commands
|fin = open(infile_fullname, “rb”)
bigstring += ” ” + fin.read()
to add the text from each file into one big string, which we’ve (rather imaginatively) called bigstring.
For Dan’s jobsite data, the first 200 characters of bigstring look like this to a human:
Senior Director of Development
SPG/Premium Outlets – NJ
• Responsible for overseeing domestic
And like this to a Python program:
|‘Title\r\n Senior Director of Development\r\nDepartment/Mall\r\n SPG/Premi
um Outlets – NJ\r\nLocation\r\n Roseland, NJ\r\n\r\nDescription\r\n\r\n PR
IMARY PURPOSE: \r\n\r\n \x95 Responsible for overseeing domestic ‘
That’s what Python sees: a long sequence of characters, some of which are letters (both uppercase and lowercase), some punctuation (spaces, commas, dashes, full stops etc), character sequences (“\r\n”) to show line endings and special codes (“\x95”) for other characters like “•”.
Your program needs to split that sequence of characters into words. We could do this the hard way – look at each character in bigstring, adding it to a word if it’s alphabetical, and creating a new word if it’s not alphabetical etc., but Python (and many other languages) has a really great shortcut for text processing, known as regular expressions.
Regular expressions – seen in Python as the “Re” library – are a fast way of searching for text patterns in strings (including very very large strings). I’m not going to pretend that regular expressions are easy, but I am going to insist that you’ll find them very useful, and it’s worth the pain of learning about them because that allows you to do much more powerful things to your text. For today, I’m going to show you the regular expression that I used to convert bigstring into a list of all the words in it (within limits: for this application, I ignored things like hyphenated words and words with numbers and non-alphabetical characters in them).
The regular expression I used for the jobsites was:
words = re.sub(‘[\W_]+’, ‘ ‘, bigstring.lower()).split()
This combined the re library function “sub” with some standard python string functions (“lower” and “split”). First, I wanted the words returned to be all-lowercase (there’s nothing more annoying than getting separate frequencies for “Follow”, “FOLLOW” and “follow” in your results). For this, I used the string expression “bigstring.lower()”. You can do this to any string in Python, and it will lowercase all your text. The results was
lowerstring = ‘title\r\n senior director of development\r\ndepartment/mall\r\n spg/premium outlets – nj\r\nlocation\r\n roseland, nj\r\n\r\ndescription\r\n\r\n primary purpose: \r\n\r\n \x95 responsible for overseeing domestic’
(I’ve called this lowerstring so you can see what happens next). Next, I used re.sub(‘[\W_]+’, ‘ ‘, lowerstring) to convert any sets of characters in the text that AREN’T alphabetical into spaces. The result of this is
cleantext = ‘title senior director of development department mall spg premium outlets nj location roseland nj description primary purpose responsible for overseeing domestic ‘
Which just leaves the final step of using the spaces to split the whole text into words. This is what the “split()” function does – if you use characters as parameters,e.g. split(“,”), it will convert the text string into a list of all the text between each of those characters, but if you leave the parameter blank, e.g. split(), it creates a list of all the text between spaces. For the jobsite example, that list starts like this:
[‘title’, ‘senior’, ‘director’, ‘of’, ‘development’, ‘department’, ‘mall’, ‘spg’, ‘premium’, ‘outlets’, ‘nj’, ‘location’, ‘roseland’, ‘nj’, ‘description’, ‘primary’, ‘purpose’, ‘responsible’, ‘for’, ‘overseeing’, ‘domestic’ …]
Next post, we’re going to look at the code needed to do useful things with this simple list of words, and at some of the issues (like wordstems) that simple lists of words can have.