11. Group

To take the next steps in our investigation, we’ll need to learn a new trick.

It’s called groupby, a pandas method that allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value.

11.1. Group by one field

As we’ve been digging through the data, I’m sure a few questions have popped into your mind. One interesting field in the contributions list is the home state of the contributor. A natural question follows: How much of the money came from outside of California?

If you scroll back up and look carefully at the info command we ran after merging out data, you will noticed it includes a column named contributor_state.

That’s the field we want to group with here. Here’s how you do it.

import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
my_committees = committee_list[committee_list.prop_name == my_prop]
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
support = merged_prop[merged_prop.committee_position == 'SUPPORT']
oppose = merged_prop[merged_prop.committee_position == 'OPPOSE']
merged_prop.groupby("contributor_state")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f096860c8e0>

A nice start but you’ll notice you don’t get much back. The data’s been grouped by state, but we haven’t chosen what to do with it yet. We want totals by state, so we can tack on code that will sum the amount field the same way we did earlier for the entire DataFrame.

merged_prop.groupby("contributor_state").amount.sum()
contributor_state
AB           0.00
AL         100.00
AZ         500.00
BC           0.00
CA    17491193.00
CO       39300.00
CT         100.00
DC     2618600.00
DE     6140100.00
FL     1250850.00
GA         100.00
ID        1000.00
IL        7000.00
LA         100.00
MA         600.00
MD        2971.00
MI         400.00
MO         550.00
NC         200.00
NV        1800.00
NY     5446985.00
OH        2950.00
OK         200.00
OR        1400.00
PA     1215475.00
RI        6700.00
TX        1800.00
VA      895243.64
WA       50800.00
Name: amount, dtype: float64

Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the reset_index method again.

merged_prop.groupby("contributor_state").amount.sum().reset_index()
contributor_state amount
0 AB 0.00
1 AL 100.00
2 AZ 500.00
3 BC 0.00
4 CA 17491193.00
5 CO 39300.00
6 CT 100.00
7 DC 2618600.00
8 DE 6140100.00
9 FL 1250850.00
10 GA 100.00
11 ID 1000.00
12 IL 7000.00
13 LA 100.00
14 MA 600.00
15 MD 2971.00
16 MI 400.00
17 MO 550.00
18 NC 200.00
19 NV 1800.00
20 NY 5446985.00
21 OH 2950.00
22 OK 200.00
23 OR 1400.00
24 PA 1215475.00
25 RI 6700.00
26 TX 1800.00
27 VA 895243.64
28 WA 50800.00

Next re-sort totals from highest to lowest. Remember the sort_values trick we learned earlier? That’ll do it.

merged_prop.groupby("contributor_state").amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_state amount
4 CA 17491193.00
8 DE 6140100.00
20 NY 5446985.00
7 DC 2618600.00
9 FL 1250850.00
24 PA 1215475.00
27 VA 895243.64
28 WA 50800.00
5 CO 39300.00
12 IL 7000.00
25 RI 6700.00
15 MD 2971.00
21 OH 2950.00
26 TX 1800.00
19 NV 1800.00
23 OR 1400.00
11 ID 1000.00
14 MA 600.00
17 MO 550.00
2 AZ 500.00
16 MI 400.00
18 NC 200.00
22 OK 200.00
1 AL 100.00
13 LA 100.00
10 GA 100.00
6 CT 100.00
3 BC 0.00
0 AB 0.00

11.2. Group by multiple fields

Finding the top contributors is almost as easy, but since the first and last names are split between two fields we’ll need to submit them to groupby as a list. Copy the last line above, and replace contributor_state with a list.

merged_prop.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
483 SEAN PARKER AND AFFILIATED ENTITIES 7250000.0
482 SEAN PARKER 1500000.0
281 JULIE SCHAUER 1364000.0
108 DANIEL LEWIS 1250000.0
194 HENRY VAN AMERINGEN 1000000.0
... ... ... ...
430 RICH GOLDBERG 25.0
358 MARSTON MEDLOCK 0.0
497 STEPHANIE WARREN-SAMARIPA 0.0
16 AMIRGHASEM SAMIMI-ARDESTAN 0.0
136 DAVID WOOD 0.0

549 rows × 3 columns

Note

You may notice that several of the top contributors appear to be the same person with their name entered in slightly different ways.

This is another important lesson of campaign contributions data. Virtually none of the data is standardized by the campaigns or the government.

The onus is on the analyst to show caution and responsibly combine records where the name fields refer to the same person.

To find out if each contributor supported or opposed the measure, you simply add the committee_position field to our groupby method.

merged_prop.groupby(["contributor_firstname", "contributor_lastname", "committee_position"]).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname committee_position amount
483 SEAN PARKER AND AFFILIATED ENTITIES SUPPORT 7250000.0
482 SEAN PARKER SUPPORT 1500000.0
281 JULIE SCHAUER OPPOSE 1364000.0
108 DANIEL LEWIS SUPPORT 1250000.0
194 HENRY VAN AMERINGEN SUPPORT 1000000.0
... ... ... ... ...
430 RICH GOLDBERG SUPPORT 25.0
358 MARSTON MEDLOCK SUPPORT 0.0
497 STEPHANIE WARREN-SAMARIPA SUPPORT 0.0
16 AMIRGHASEM SAMIMI-ARDESTAN SUPPORT 0.0
136 DAVID WOOD SUPPORT 0.0

549 rows × 4 columns

If you want the top supporters or opponents alone, run those same commands with the support and oppose datasets we filtered earlier. Everything else about the commands would be the same as the first one above.

For the supporters:

support.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
444 SEAN PARKER AND AFFILIATED ENTITIES 7250000.0
443 SEAN PARKER 1500000.0
102 DANIEL LEWIS 1250000.0
180 HENRY VAN AMERINGEN 1000000.0
374 NICHOLAS PRITZKER 900000.0
... ... ... ...
328 MARK SLABAUGH 25.0
128 DAVID WOOD 0.0
16 AMIRGHASEM SAMIMI-ARDESTAN 0.0
458 STEPHANIE WARREN-SAMARIPA 0.0
331 MARSTON MEDLOCK 0.0

507 rows × 3 columns

For the opponents:

oppose.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
20 JULIE SCHAUER 1364000.0
31 ROBERT ADAMS 6500.0
29 MICHAEL DEAN 5000.0
13 GEORGE BELL 1550.0
12 FARSHAD FARDAD 1000.0
17 JESSICA MACIEL 1000.0
9 EDITH REUVENI 1000.0
3 CHRISTINE MILLER 871.0
25 LORI ROBINSON 817.0
8 DWIGHT BELL 730.0
23 KOLALEH TABIBZADEH 500.0
24 KOUROSH MADDAHI 500.0
0 ANDREW GRZYWA 500.0
39 THOMAS DAWSON 500.0
36 RONALD LAWRENCE 300.0
22 KEN CORNEY 300.0
27 MARTIN RYAN 250.0
14 ISRAEL CARRERO 250.0
30 MIGUEL COLON JR 200.0
37 SAM SPIEGEL 200.0
41 WILLIAM WINTERS 200.0
6 DAVID DARRIN 200.0
19 JOHN ROBERTS 200.0
2 CECILIA GALVIN 150.0
15 JAMES DAVIS 150.0
26 MAJID TALEBI 150.0
5 DAN DESMIDT 150.0
33 ROBERT HANDY 100.0
40 WILLIAM MADDEN 100.0
38 SCOTT SEAMAN 100.0
4 CLARENCE STEWART 100.0
35 RONALD JOHNSON 100.0
34 ROBERT SANDERSON 100.0
1 BILL KOUSENS 100.0
32 ROBERT ELSBERG 100.0
7 DAVID MCCARTNEY 100.0
10 EILEEN JANDRISEVITS,MS,' BY CADC-I, 100.0
28 MIA CHUNG 100.0
11 ERIC NUNEZ 100.0
16 JEREMY FORBES 100.0
18 JOHN CALLAS 100.0
21 KAREN AKAGI 100.0

11.3. How to not be wrong

You’ve done it. Our brief interview is complete and you’ve answered the big question that started our inquiry.

Or so you think! Look again at our rankings above. Now compare them against the ranking we looked at earlier in our sorting lesson.

Study it closely and you’ll see an important difference. All of the contributors without a first name are dropped from our groupby lists. And some of them gave a lot of money.

This is happening because of another pandas quirk. Empty fields are read in by pandas as null values, which is a mathematical representation of nothing. In pandas a null is called a NaN, an abbreviation for “not a number” commonly used in computer programming.

And, bad news, pandas’ groupby method will drop any rows with nulls in the grouping fields. So all those records without a first name were silently excluded from our analysis. Yikes!

Whatever our opinion of the default pandas behavior, it’s something we need to account for, and a reminder that we should never assume we know what computer programming tools are doing under the hood. As with human sources, everything your code tells you should be viewed skeptically and verified.

The solution to this problem is easy. We need to instruct pandas not to drop the null values by adding an extra option to groupby. It’s known as dropna.

merged_prop.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
483 SEAN PARKER AND AFFILIATED ENTITIES 7250000.0
580 NaN FUND FOR POLICY REFORM 6140000.0
581 NaN FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4)) 3940000.0
573 NaN DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ... 3000000.0
598 NaN NEW APPROACH PAC (MPO) 2615000.0
... ... ... ...
448 ROBERT LOCKART 25.0
16 AMIRGHASEM SAMIMI-ARDESTAN 0.0
358 MARSTON MEDLOCK 0.0
497 STEPHANIE WARREN-SAMARIPA 0.0
136 DAVID WOOD 0.0

620 rows × 3 columns

Now we’ve finally got a ranking we can work with.