Who is the Daily Muckiest?


One of the hits I make each day as part of my morning reading is The Daily Muck. It's a quick and easy digest of investigative news stories published by one of the hotter political blogs, TPMmuckraker.com.

I doubt it's anywhere near as well-read as other tip sheets like the Drudge Report or The Note, but I find it to be a good way to keep tabs on the pulse of Washington, particularly what issues are resonating with the young, left-of-center types embodied by the site's leader, Josh Marshall.

Marshall's greatest claim to fame is his site's role in driving the scandal currently nagging at Attorney General Alberto Gonzalez. There is a conventional wisdom already taking shape in Washington. It goes like this: By collecting, digesting and analyzing a wide menu of news stories from across the country, Marshall and his fellow bloggers were the first journalists to connect all of the dots and press for greater inquiry into the Department of Justice's firing of a number of high-level prosecutors.

As the story goes, TPM's virtue wasn't in discovering any new information (though they have done that). Instead they're credited for their assiduous job of surveying the information available and pressing their conclusions.

Thinking about this got me interested in knowing a little bit more about where they get their information and what sources they favor.

So this morning I fired off a quick experiment. I set out to find out which Web sites and news organizations the Daily Muck cites most. Who, in other words, is the muckiest?

For anyone that's interested, here is a spreadsheet with the results.

What you can see is that Daily Muck citations have been dominated by the print news outlets that most closely cover Washington politics. The Washington Post, The NY Times, Roll Call, and the wire services provided by Boston.com and Yahoo news dominate the list. You'll note that the McClatchy team at RealCities.com (formerly Knight-Ridder) so widely celebrated for its critical coverage of the buildup to the Iraq war has won frequent citation and, as you move down the list, you can see that the remainder is filled out by mid-sized newspapers, Washington blogs, and a few alternative news sources, like my employer The Center for Public Integrity.

When you're looking at the ranking, there are a couple of things to keep in mind.

One: One link does not equal one citation. In many cases, multiple links are provided inside of one feature.

Two: All those links to Boston.com does not mean that the Globe had the most stories featured. I haven't exhaustively studied the postings, but my cursory analysis this morning suggests that the Daily Muck's authors often use Boston.com as the source of the stories they want to highlight written by the Associated Press and other wire services. You'd have to examine the records more closely to figure just where the Globe, or AP, figures in the reckoning.

Three: I have not thoroughly standardized the records. Sites that have variations in their domain name (thehill.com vs. www.thehill.com or www.washingtonpost.com vs. blogs.washingtonpost.com) are reported as separate entities. The data could certainly use some more scrubbing, so don't treat it as gospel.

Four: I ground this our very quickly. The archive page here provides a scroll-like archive of all the Muck posts ranging back to February 2006. I snatched the posts out of the html source code, dumped it into a text file and then parsed out all of the domain names using a quick script. Here it is for any CAR heads or fellow Perl hacks in the crowd.

#!/usr/bin/perl -w

use strict;

use HTML::TokeParser;

my $folder = 'C:/temp';

my $Muckfile =  $folder . '/muckfile.txt';

open(Muckfile, "<", $Muckfile ) or die "I'm dead";

my $Muckurls =  $folder . '/muckurls.txt';

open(Muckurls, ">", $Muckurls ) or die "I'm dead";

my $p = HTML::TokeParser->new($Muckfile);

while (my $token = $p->get_tag("a")) {

my $url = $token->[1]{href} || "-";

if ( $url =~ m/http/ ){

if ( $url =~ /\bhttp\:\/\/(.*?)\b\/\b/) {

my $cleanurl = $1;

if ($cleanurl !~ m/www.tpmmuckraker.com/ ) {

print Muckurls "$cleanurl\n";





close Muckurls;

close Muckfile;

I used the HTML::TokeParser module to extract all of the links. You'll note that there are three tiers in my loop, each with its own regular expression. They're intended to act as filters. The first limits the results to only the links that lead to URLs opening with the standard 'http,' thereby eliminating a number of internal references; the second extracts the string commonly found between a URL's 'http' stem and the '/' following its domain name, this standardizes links from different pages published by the same site using their common root (ex. washingtonpost.com); the third removes any links to pages on the TPM site, since we're only interested here in the links they make to other sites.

It's quick and it's dirty, but it gives us a rough idea.