Python Recipe: Read file, find pattern, print matches

By Ben Welsh • April 14, 2008

Our first two recipes focused primarily on how to open one or more files and loop through them line by line. While we paid a little attention to how we could search for patterns using regular expressions, we didn't try to do a whole lot with what we caught. Hell, we didn't even try very hard to write a good regex.

But when you start to get serious about searching for patterns in text, one of the obvious goals is to single out and collect your matches. Maybe you want to pull all the phone numbers out of big blobs of text. Or email addresses. Or anything enclosed in quotation marks. Whatever.

Here's one way to try it.

But before we get going, let me just say that I'm going to assume you read the first couple recipes and won't be working too hard to explain the stuff covered there. And keep in mind that my keystrokes are coming right off my home computer, which runs Ubuntu Linux. I'll try to provide Mac and Windows translations as we go, but I might muck a phoneme here and there. If anything is screwed up and doesn't work on your end, just shoot me an email or drop a comment. We'll iron it out.

Formalities aside, here's the example task I've selected to achieve our mission.

Download the King James Version of the Holy Bible.
Read through each line of text.
Capture each four-letter word.
Print them out.

Let's do it.

1. Open the command line, create a working directory, move there.

We're going to start the same way we did in the first two lessons, creating a working folder for all our files and moving in with our command line.

cd Documents/
mkdir py-search-and-capture
cd py-search-and-capture/

The commands should work just as easily in Mac as in Linux. If you're working in Windows, you'll be on the "C:/" file structure, rather than the Unix-style structure above. So you might"mkdir" a new working directory in your "C:/TEMP" folder or wherever else you'd like to work. Or just make a folder wherever through Windows Explorer and "cd" there after the fact through the command line.

2. Download our source file, The King James Version of the Holy Bible

We're going to use the text file provided Project Gutenberg as our source. As in the earlier lessons, I'm going to use the curl command-line utility to retrieve the file, but you should feel free to download it to our working directory using your web browser, if you prefer.

curl -O http://www.gutenberg.org/dirs/etext92/bible11.txt

3. Create our python script in the text editor of your choice.

vim search.py

The line above, which again should work for Linux or Mac, will open a new file in vim, the command-line text editor that I prefer. You can follow along, or feel free to make your own file in the application you prefer. If you're a newbie Windows user, Notepad should work great.

If you're following along in vim, you'll need to enter "insert mode" so you can start entering text. Do that by hitting:

4. Write the code!

#!/usr/bin/env python
import re

bible = open("kjv10.txt", "r")
regex = re.compile(r'\b\w{4}\b')

for line in bible:
    four_letter_words = regex.findall(line)
    for word in four_letter_words:
        print word

Our file opens by importing the re module, which will allow us to call upon Python's regular expression library. We then open our Bible into a variable of the same name and, as in previous recipes, open a loop that will iterate through each line in the file.

The new stuff comes next. The first statement above our loop uses re's compile method to store our regular expression pattern into a variable called "regex." (As commenter Paddy suggested below, it's a good idea to put it above the loop so it doesn't have to be repeated on each iteration.) Remember, our goal is to match any four-letter words. There are three regex symbols I squished together to give it a hack. They are defined as follows. I've drawn the definitions from this Python reference, which can probably help you crack most nuts.

\b - Word boundary. This is a zero-width assertion that matches only at the beginning or end of a word.
\w - Matches any alphanumeric character
{m,n} - There must be at least m repetitions, and at most n.

So when you piece them together like so, "\b\w{4}\b", what you're asking for is any stretch of four alphanumeric characters between two word boundaries. Make sense?

Next, equipped with our regex, we create another variable called "four_letter_words." In it we see our regex variable pressed against a re method we haven't used before. In the previous lessons we used the kludgy match() function to make our hits. Here we're using something more elegant. It's findall(), which will return all the matches within our line as a list. And by connecting it to our pre-compiled "regex" variable, we're setting that as the pattern it should look for.

We can expect plenty of lines with more than one match, so we'll set up another loop to run through "four_letter_words" and print out all the hits. And then we're done. Save and quit out of your script (ESC, SHIFT+ZZ in vim) and fire it up from the command-line:

python py-search.py

And, voila, there you have it. All the four-letter words in the KJV. Every f*cking one.

Unless I messed something up, of course. Per usual, if you spot a screw up, or I'm not being clear, just shoot me an email or drop a comment and we'll sort it out. Hope this is helpful to somebody.

Comments

the gun on 2008.04.15

This definitely demonstrates the magnitude of regular expressions. My usual bombardment of QUESTIONS (these may be pretty rudimentary): 1. If you don't use \w in the regex, will it find anything, including whitespace chars?noticed we didn't use in the shakespeare recipes. 2. "{m,n} - There must be at least m repetitions, and at most n." By just putting the 4 inside {}, does this mean 4 is both m and n? In other words, same as {4,4} 3. still not sure I understand the line variable. Is that some standard keyword or our own created variable. I guess I can't figure out if it knows something is a line since we didn't specify that. 4. Why do we open the regex with an r ? According to that link you gave (great resource!), is it to avoid excape (\) characters? 5. We don't need a closing/lingering comma?

Permalink

palewire on 2008.04.15

1. What happens if you don't include any between the word boundaries? Like "\b\b"? Hmm. I can't think of a situation where you'd find two word boundaries in a row. But I'll never say never when it comes to this stuff. 2. Yep, you're right. 3. I don't know all the details on this one, but my understanding is that "for line in" is a tuned up version of earlier Python methods for line-reading, like xreadlines(). As a low-level user, I really don't know what's happening behind the scenes. Here are the basic input and output docs. 4. You're right, the purpose of prefixing a regular expression with an "r" is to avoid the "backslash plague" by using r to switch to "raw string notation." That's necessary here because we're using the "\b" word boundary. As the guide I linked to says:

In Python's string literals, "\b" is the backspace character, ASCII value 8. If you're not using raw strings, then Python will convert the "\b" to a backspace, and your RE won't match as you expect it to. The following example looks the same as our previous RE, but omits the "r" in front of the RE string.

>>> p = re.compile('\bclass\b')
>>> print p.search('no class at all')
None
>>> print p.search('\b' + 'class' + '\b')

5. I'm not sure what you mean about a "closing/lingering comma." That's a convention I'm unfamilar with, unless you mean the "print hits," style comma I might have used in past recipes. By default, Python's print command will follow whatever it spits out with a newline command "\n" that will kick you down to the next line. But if you stick a comma at the end of your print statements, it will stop doing that.

Permalink

ben on 2008.07.03

Hey, I just thought of something, i've just followed through the recipes for python you have here about searching through text, and i've been downloading all the source text from gutenberg.org. I just went over to the site to see if there was any other source texts i could play around with, and i realised there was a zipped text version of all the files we have been using. Whenever its available I try to download compressed versions of files, especially on 'free' projects like gutenberg, as it saves them bandwidth and thus cost. I'm not sure how much traffic your site gets, but even if its just a little bit you could have cut down on bandwidth used by getting the zip files and unzipping them locally. It would also give you an excuse to show people how to batch unzip files from CLI.

Permalink

ben on 2008.07.03

I extended the program to count unique hits only if anyone is interested, or could tell me a more efficient way. (maybe python has a binary tree module or something?) [code] #!/usr/bin/env python import re bible = open("kjv10.txt","r") regex = re.compile(r"\b\w{4}\b") count = 0 words = [] for line in bible: fourletterwords = regex.findall(line) # returns all matches as a list for word in fourletterwords: for w in words: if w == word: break else: words.append(word) count = count + 1 print count, word print count, "unique words" [/code]

Permalink