9. Totals

In some ways, your database is no different from a human source. Getting a good story requires careful, thorough questioning.

In this section we will use pandas to interview our data as we continue our quest to find out the biggest donors for and against Proposition 64.

9.1. Sum a column

Using tricks we learned as far back as chapter two, we can start off by answering a simple question: What is the total sum of Proposition 64 contributions that have been reported?

To answer that let’s start by getting our hands on amount, the column from the contributions DataFrame with numbers in it. We can do that just as we did with other columns earlier.

import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
merged_prop.amount
2102      550000.00
2103      420000.00
2104     1000000.00
2105     1970000.00
2106     1403000.00
            ...    
83884      30000.00
83885       4185.00
83886        817.00
83887        408.64
83888        871.00
Name: amount, Length: 860, dtype: float64

Now we can add up the column’s total using the pandas method sum, just as we did when we were first getting started.

merged_prop.amount.sum()
35177017.64

We’ve completed our first piece of analysis and discovered the total amount spent on this proposition. Time to run off to Twitter and publish our results to the world, right?

Wrong.

9.2. How to not be wrong

The total we generated is not the overall total raised in the campaign, and it is guaranteed to be lower than the totals reported in the media and by the campaigns.

Why?

In California, campaigns are only required to disclose the names of donors who give over $100, so our data is missing all of the donors who gave less than that amount.

The cutoff varies, and there are some exceptions, but the same thing is true in other states and also at the federal level in races for Congress and the White House.

The overall totals are instead reported on cover sheets included with disclosure reports that lump together all the smaller contributions as part of a grand total. Those are the records most commonly cited to total up a campaign’s fundraising.

The result is that an itemized list of contributions, like the one we have, cannot be used to calculate a grand total. That’s true in California and virtually anywhere else you work with campaign data. Overlooking that limitation is a rookie mistake routinely made by analysts new to this field.

But that doesn’t mean our data are worthless. We just have to use our list responsibly. In many cases, professional campaign reporters will refer to an analysis like ours as applying only to “large donors.”

Since large donors typically account for most of the money, the results are still significant. And the high level of detail included in each record — like the donor’s name, employer and occupation — makes the limitations worth working through.

9.3. Which side raised more?

Adding up a big total is all well and good. But we’re aiming for something more nuanced.

We want to separate the money spent supporting the proposition from the money opposing it. Then we want to find out which side raised more.

To answer that question, let’s return to the filtering technique we learned in chapter seven. Let’s look at the column we’re going to filter by, committee_position.

merged_prop.committee_position.value_counts()
SUPPORT    762
OPPOSE      98
Name: committee_position, dtype: int64

Filter our merged_prop table down using that column and the pandas filtering method that combines a column, an operator and the value we want to filter by. Let’s stick the result in a variable.

support = merged_prop[merged_prop.committee_position == 'SUPPORT']

Repeat all that for opposing contributions. First the filter into a new variable.

oppose = merged_prop[merged_prop.committee_position == 'OPPOSE']

Sum up the total disclosed contributions to each for comparison. First the opposition.

oppose.amount.sum()
2501211.64

Then the supporters.

support.amount.sum()
32675806.0

The support is clearly larger. But what percent is it of the overall disclosed total? We can find out by combining two sum calculations using Python’s built-in division operator.

support.amount.sum() / merged_prop.amount.sum()
0.9288964270479867