Python Recipe: Open a file, read it, print matching lines

By Ben Welsh • April 5, 2008

A couple of friends out there are valiantly teaching themselves the Python programming language in their free time. Who are they? Hack reporters like me, picking up computer skills in a continuing quest to better sift, organize and analyze information. And, in the process, maybe keep our jobs.

There are a couple great books available free online but it's pretty tough to start stringing all the fundamentals into a problem-solving script all on your own. So why not write up some simple recipes that attack problems common to our particular tribe?

One of the ways computer programming can be of great use to a reporter is as a text parser. We all have more documents than we have time. So a common challenge is training your computer to read through a big blob of text and return any hits on terms you're interested in (i.e. the name of the mayor, a popular pesticide, a roster of local police officers).

If it's a one-off effort, you can probably get this done quickly using search tools included in common quality text editors (ex. Ultraedit, Notepad++, TextMate). But if you've got a steady stream of files, like a weekly dump of court filings, or a really big bad file, sometimes it's preferable to train your computer to do the work for you.

In that spirit, the following instructions are designed to show you how to use Python to search through a text file (The Sonnets of William Shakespeare), find any lines that contain our sample search term ("love"), and then print out the hits into a new file we can keep as a memento.

We'll be dealing with a source file that's probably cleaner than most documents you'll get from the government, and certainly a lot tidier than anything you've converted from a PDF file using an OCR application, but if you're a totally newbie, my hope is that this can help you get a grip on how the hell all the pieces described in the textbooks fit together into something almost useful.

Since I'm now a full-time geek, I do most of my work on computers that run some flavor of Linux. The step-by-step instructions that follow will walk you through each keystroke on the command line in Ubuntu, which is what I run at home. But since most people who might be interested in this are probably running Windows XP or Mac OS X, I'll try to include translations as we go.

The one prerequsite for the whole endeavor is that you already have a working installation of Python. If you're working in Windows and you don't, I'd recommend visiting ActiveState and downloading the installer for their ActivePython distribution. If you're rocking a Macbook, you can find out whether you're rolling with Python by opening your terminal and entering the following:

which python

If you've got it properly installed, it should return something like

/usr/bin/python

If it's not working out, I'd recommend the installation instructions in Mark Pilgrim's excellent book, Dive Into Python.

Alright, with all that out of the way, let's get to the recipe.

1. Open the command line, create a working directory, move there.

cd $HOME
mkdir Documents/py-search
cd Documents/py-search

The three commands above, which should work just as easily in Mac as in Linux, will move us to our home directory, create a new subdirectory in your Documents folder, and relocate to the new folder.

If you're working in Windows, you'll be on the "C:/" file structure, rather than the Unix-style structure above. So you might "mkdir" a new working directory in your "C:/TEMP" folder or wherever else you'd like to work. Or just make a folder wherever through Windows Explorer and "cd" there after the fact through the command line.

2. Download our source file, The Sonnets of William Shakespeare.

curl -O http://www.gutenberg.org/dirs/etext97/wssnt10.txt

The line above uses the curl command line utility to download a copy of Shakespeare's work from the Project Gutenberg Web site. Mac users with curl installed should be able to issue the same command. Windows users, or anyone without curl, will probably be able to most easily snatch the file just by visiting the link in a web browser and saving the file to the working directory created in step one.

3. Create our python script in the text editor of your choice.

vim py-search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you're a newbie Windows user, Notepad should work great.

If you're following along in vim, you'll need to enter "insert mode" so you can start entering text. Do that by hitting:

4. Write the code!

#!/usr/bin/env python

import re

shakes = open("wssnt10.txt", "r")

for line in shakes:
    if re.match("(.*)(L|l)ove(.*)", line):
        print line,

If, like my friends, you've been working through some common Python tutorials, I'm guessing a lot of that looks familar to you.

The first line is a "shebang" that, on execution of the file, instructs the computer to process the script using the python interpreter. The "import re" pulls in Python's standard regular expression module so we can use it later to search the text. The open() command grabs the Shakespeare file we've just downloaded and opens it up. The "r" is for read mode.

The three staggered statements that follow are a loop that runs through each line in the document, as dictated by the first statement. The second statement uses the re.match() function we imported at the top to evaluate the latest line on each iteration through the loop by testing it against that scary looking mess in its first parameter.

So, what is that thing? "(.*)(L|l)ove(.*)", say what?

That's a regular expression I designed specifically to catch any instances of the search term I'm after. If you're not familar with regular expressions, they're a powerful language for matching strings of text. When you first get started, they can be a bit intimidating, but once you learn a couple tricks, you'll quickly see how useful they can be. One of my favorite geek jokes is this cartoon on the utility of a well-timed "regex"

So how does it work? There are two tricks to learn. Remember, our goal is to find any line in Shakespeare's sonnets that include the word love. But, when we think about it, we can't just search for "love" because our loop is evaluating the text line by line, not word by word. So if we just ask for "love," we'd only get lines that include only the word "love." Plus the word could appear in any number of common grammatical variations (ex. "Love," "lover," "lovesick," "self-love") that we'd also like to capture.

That's where the regular expressions come in. You'll notice that the expression is bracketed by two "(.*)" statements. In regular expression language, the "." command matches any string and the "*" repeats whatever command precedes it zero or more times, so together they will match any string of any length. When bracketed around a search term, like "love," it should return a match on a line of text regardless of where in the line "love" appears. In other words, it would match "She loves you," "love is a many spendored thing" or "ain't talking 'bout love."

But, all by itself "(.*)love(.*) wouldn't match "America: What Time is Love?" or "Love Is Only A Feeling." Why not? Because those songs have an uppercase L and we're just asking for lowercase. Bummer, right?

One way to fix that would be to add an option that gives the regular expression variations on the term to look for. You can do that by adding another parenthesis set and separating the options with a "|" pipe. That's where the "(L|l)" above came from. Combine that with the (.*) commands and we should have a quick and dirty regex to catch the lines we're after. Though quick studies will catch a flaw in the design. As we'll see in our result set later, this sort of dragnet approach will also yield hits on things we might not want to catch, words like "glove" and "lovely" will match just as easily as "lovesick" or "lover." Feel free to tweak the statement and try to finetune your results. There's a ton more you can do with regular expressions than what I've described. So don't take my example too seriously. I just wanted to show off a couple of the most common regex commands.

5. Save your script and run it.

If you're working along with me in vim, you'll need to save your work before exiting. The easiest way to do that is to exit insert mode by hitting the ESC key and then hold SHIFT and hit the Z key twice in a row. If you're working in your own text editor, just save it however you're comfortable.

Now jump back onto the command line resting in your working directory and tell python to fire that mother off.

python py-search.py

Voila. There they are, flying across your screen is every line in Shakespeare's sonnets containing the word love. And if you wanted to print them out to a new text file, rather than just dump them on the screen, jump back into your script and try something more like this.

#!/usr/bin/env python
import re

shakes = open("wssnt10.txt", "r")
love = open("love.txt", "w")

for line in shakes:
    if re.match("(.*)(L|l)ove(.*)", line):
        print >> love, line,

Now just open love.txt and you should find the same results as before.

The only difference in this script is that we're now opening an outfile called love (notice that it's "w" mode, for write, rather than "r" mode like the source) and modifying our print line to kick the results there, instead of the console.

That's all folks. Per usual, if I've screwed something up, or I'm not being clear, just shoot me an email or drop a comment and we'll sort it out. Hope this is helpful to somebody.

Comments

palewire on 2008.04.12

Good catch. The encoding thing you're talking about is well documented here. This is an early experiment with putting code snippets on the blog and I think we've bumped into a bug. Since Python defaults to ASCII encoding, I suppose I should try to find a way to make this easier on folks. I hadn't thought of it, since I personally prefer to retype code from tutorials rather than cut-and-paste, because I have some superstitious belief that it will help me memorize the commands, or at least think a little harder about what the hell they might be doing. I have no real evidence that actually works, but it makes me feel good.

Permalink

tom on 2008.04.15

ok, the view code thing kicks ass, looks like javascript only? Tried it twice on new recipe. First, cut-and-pasted formatted text from the blog page, got non-ascii error, halted. Then ran code cut-and-pasted from popup plain text "view code" window, ran great. When it comes to copying-and-pasting, does Bash (in Terminal) also care if the text is ascii or utf or iso or whatever? The whole topper thing brings up a novice recurring question for me: why is python's declaration path: $ /usr/bin/env python and not just $ /usr/bin/ which has the python files. is env some kind of pointer file that every system has, sort of a preferences file?

Permalink

palewire on 2008.04.15

Okay. I'm not sure I follow, but it sounds like maybe the new code formatting is working for you. I assume the Mac terminal must have some sort of encoding scheme (Isn't it in the display preferences or something?), but I really don't know how it works. These guys claim to be on top of it, for whatever that's worth. As far as python's "shebang" goes, you're right, "env" is the home of all your environment variables, which includes the PATH varible that would show the way to your "python" command and interpreter. Try typing "env" into your bash sometime and you'll see what I mean. It also looks like Wikipedia has a helpful entry on the topic. I'm a long way from an expert on this stuff, but I'm sure that on many systems you could happily substitute...

#!/usr/bin/python

... or whatever the direct path is to your python interpreter's binary file and it would still work fine. In fact, I bet that, depending on how you execute the file...

#!python

...might even work. Frankly, the only reason I use /usr/bin/env is superstition, it's the convention I was taught when I started. My guess for why it's better: I think it would probably still work if my python installation was not at /usr/bin/python, but still somewhere else on the PATH. If you explictly state the location of python, then I suppose it would probably stop working if you moved to a system where it was someplace else. For example, what if you're running Mac OSX and you install Python or Ruby or whatever using Fink or MacPorts or whatever and it puts it in some weird location.

Permalink