A couple of friends out there are valiantly teaching themselves the Python programming language in their free time. Who are they? Hack reporters like me, picking up computer skills in a continuing quest to better sift, organize and analyze information. And, in the process, maybe keep our jobs.
There are a couple great books available free online but it's pretty tough to start stringing all the fundamentals into a problem-solving script all on your own. So why not write up some simple recipes that attack problems common to our particular tribe?
One of the ways computer programming can be of great use to a reporter is as a text parser. We all have more documents than we have time. So a common challenge is training your computer to read through a big blob of text and return any hits on terms you're interested in (i.e. the name of the mayor, a popular pesticide, a roster of local police officers).
If it's a one-off effort, you can probably get this done quickly using search tools included in common quality text editors (ex. Ultraedit, Notepad++, TextMate). But if you've got a steady stream of files, like a weekly dump of court filings, or a really big bad file, sometimes it's preferable to train your computer to do the work for you.
In that spirit, the following instructions are designed to show you how to use Python to search through a text file (The Sonnets of William Shakespeare), find any lines that contain our sample search term ("love"), and then print out the hits into a new file we can keep as a memento.
We'll be dealing with a source file that's probably cleaner than most documents you'll get from the government, and certainly a lot tidier than anything you've converted from a PDF file using an OCR application, but if you're a totally newbie, my hope is that this can help you get a grip on how the hell all the pieces described in the textbooks fit together into something almost useful.
Since I'm now a full-time geek, I do most of my work on computers that run some flavor of Linux. The step-by-step instructions that follow will walk you through each keystroke on the command line in Ubuntu, which is what I run at home. But since most people who might be interested in this are probably running Windows XP or Mac OS X, I'll try to include translations as we go.
The one prerequsite for the whole endeavor is that you already have a working installation of Python. If you're working in Windows and you don't, I'd recommend visiting ActiveState and downloading the installer for their ActivePython distribution. If you're rocking a Macbook, you can find out whether you're rolling with Python by opening your terminal and entering the following:
If you've got it properly installed, it should return something like
If it's not working out, I'd recommend the installation instructions in Mark Pilgrim's excellent book, Dive Into Python.
Alright, with all that out of the way, let's get to the recipe.
cd $HOME mkdir Documents/py-search cd Documents/py-search
The three commands above, which should work just as easily in Mac as in Linux, will move us to our home directory, create a new subdirectory in your Documents folder, and relocate to the new folder.
If you're working in Windows, you'll be on the "C:/" file structure, rather than the Unix-style structure above. So you might "mkdir" a new working directory in your "C:/TEMP" folder or wherever else you'd like to work. Or just make a folder wherever through Windows Explorer and "cd" there after the fact through the command line.
curl -O http://www.gutenberg.org/dirs/etext97/wssnt10.txt
The line above uses the curl command line utility to download a copy of Shakespeare's work from the Project Gutenberg Web site. Mac users with curl installed should be able to issue the same command. Windows users, or anyone without curl, will probably be able to most easily snatch the file just by visiting the link in a web browser and saving the file to the working directory created in step one.
The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you're a newbie Windows user, Notepad should work great.
If you're following along in vim, you'll need to enter "insert mode" so you can start entering text. Do that by hitting:
#!/usr/bin/env python import re shakes = open("wssnt10.txt", "r") for line in shakes: if re.match("(.*)(L|l)ove(.*)", line): print line,
If, like my friends, you've been working through some common Python tutorials, I'm guessing a lot of that looks familar to you.
The first line is a "shebang" that, on execution of the file, instructs the computer to process the script using the python interpreter. The "import re" pulls in Python's standard regular expression module so we can use it later to search the text. The open() command grabs the Shakespeare file we've just downloaded and opens it up. The "r" is for read mode.
The three staggered statements that follow are a loop that runs through each line in the document, as dictated by the first statement. The second statement uses the re.match() function we imported at the top to evaluate the latest line on each iteration through the loop by testing it against that scary looking mess in its first parameter.
So, what is that thing? "(.*)(L|l)ove(.*)", say what?
That's a regular expression I designed specifically to catch any instances of the search term I'm after. If you're not familar with regular expressions, they're a powerful language for matching strings of text. When you first get started, they can be a bit intimidating, but once you learn a couple tricks, you'll quickly see how useful they can be. One of my favorite geek jokes is this cartoon on the utility of a well-timed "regex"
So how does it work? There are two tricks to learn. Remember, our goal is to find any line in Shakespeare's sonnets that include the word love. But, when we think about it, we can't just search for "love" because our loop is evaluating the text line by line, not word by word. So if we just ask for "love," we'd only get lines that include only the word "love." Plus the word could appear in any number of common grammatical variations (ex. "Love," "lover," "lovesick," "self-love") that we'd also like to capture.
That's where the regular expressions come in. You'll notice that the expression is bracketed by two "(.*)" statements. In regular expression language, the "." command matches any string and the "*" repeats whatever command precedes it zero or more times, so together they will match any string of any length. When bracketed around a search term, like "love," it should return a match on a line of text regardless of where in the line "love" appears. In other words, it would match "She loves you," "love is a many spendored thing" or "ain't talking 'bout love."
But, all by itself "(.*)love(.*) wouldn't match "America: What Time is Love?" or "Love Is Only A Feeling." Why not? Because those songs have an uppercase L and we're just asking for lowercase. Bummer, right?
One way to fix that would be to add an option that gives the regular expression variations on the term to look for. You can do that by adding another parenthesis set and separating the options with a "|" pipe. That's where the "(L|l)" above came from. Combine that with the (.*) commands and we should have a quick and dirty regex to catch the lines we're after. Though quick studies will catch a flaw in the design. As we'll see in our result set later, this sort of dragnet approach will also yield hits on things we might not want to catch, words like "glove" and "lovely" will match just as easily as "lovesick" or "lover." Feel free to tweak the statement and try to finetune your results. There's a ton more you can do with regular expressions than what I've described. So don't take my example too seriously. I just wanted to show off a couple of the most common regex commands.
If you're working along with me in vim, you'll need to save your work before exiting. The easiest way to do that is to exit insert mode by hitting the ESC key and then hold SHIFT and hit the Z key twice in a row. If you're working in your own text editor, just save it however you're comfortable.
Now jump back onto the command line resting in your working directory and tell python to fire that mother off.
Voila. There they are, flying across your screen is every line in Shakespeare's sonnets containing the word love. And if you wanted to print them out to a new text file, rather than just dump them on the screen, jump back into your script and try something more like this.
#!/usr/bin/env python import re shakes = open("wssnt10.txt", "r") love = open("love.txt", "w") for line in shakes: if re.match("(.*)(L|l)ove(.*)", line): print >> love, line,
Now just open love.txt and you should find the same results as before.
The only difference in this script is that we're now opening an outfile called love (notice that it's "w" mode, for write, rather than "r" mode like the source) and modifying our print line to kick the results there, instead of the console.
That's all folks. Per usual, if I've screwed something up, or I'm not being clear, just shoot me an email or drop a comment and we'll sort it out. Hope this is helpful to somebody.