11. Groupby¶

To take the next step towards ranking the top contributors, we’ll need to learn a new trick. It’s called groupby.

It’s a pandas method that allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This is necessary when you want to rack up statistics on a long list of values, or about a combination of fields.

Note

If you’re into databases and SQL, groupby may sound familiar.

11.1. Grouping by one field¶

As we’ve been digging through the data, I’m sure a few questions have popped into mind. One interesting field in the contributions list is the home state of the contributor. A natural question follows: How much of the money came from outside of California?

If you scroll back up and look carefully as the info command we ran after merging out data, you will noticed it includes a column named contributor_state.

That’s the field we want to group with here. Here’s how you do it.

import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
my_committees = committee_list[committee_list.prop_name == my_prop]
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
support = merged_prop[merged_prop.committee_position == 'SUPPORT']
oppose = merged_prop[merged_prop.committee_position == 'OPPOSE']

merged_prop.groupby("contributor_state")

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7feb4ca245e0>

A nice start. But you’ll notice you don’t get much back. The data’s been grouped by state, but we haven’t chosen what to do with it yet. We want totals by state, so we can sum the amount field the same way we did earlier for the entire DataFrame.

merged_prop.groupby("contributor_state").amount.sum()

contributor_state
AB           0.00
AL         100.00
AZ         500.00
BC           0.00
CA    17491193.00
CO       39300.00
CT         100.00
DC     2618600.00
DE     6140100.00
FL     1250850.00
GA         100.00
ID        1000.00
IL        7000.00
LA         100.00
MA         600.00
MD        2971.00
MI         400.00
MO         550.00
NC         200.00
NV        1800.00
NY     5446985.00
OH        2950.00
OK         200.00
OR        1400.00
PA     1215475.00
RI        6700.00
TX        1800.00
VA      895243.64
WA       50800.00
Name: amount, dtype: float64

Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the reset_index method again.

merged_prop.groupby("contributor_state").amount.sum().reset_index()

	contributor_state	amount
0	AB	0.00
1	AL	100.00
2	AZ	500.00
3	BC	0.00
4	CA	17491193.00
5	CO	39300.00
6	CT	100.00
7	DC	2618600.00
8	DE	6140100.00
9	FL	1250850.00
10	GA	100.00
11	ID	1000.00
12	IL	7000.00
13	LA	100.00
14	MA	600.00
15	MD	2971.00
16	MI	400.00
17	MO	550.00
18	NC	200.00
19	NV	1800.00
20	NY	5446985.00
21	OH	2950.00
22	OK	200.00
23	OR	1400.00
24	PA	1215475.00
25	RI	6700.00
26	TX	1800.00
27	VA	895243.64
28	WA	50800.00

Sorting totals from highest to lowest is easy. Remember the sort values trick we learned earlier? Voila! Here’s our answer:

merged_prop.groupby("contributor_state").amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_state	amount
4	CA	17491193.00
8	DE	6140100.00
20	NY	5446985.00
7	DC	2618600.00
9	FL	1250850.00
24	PA	1215475.00
27	VA	895243.64
28	WA	50800.00
5	CO	39300.00
12	IL	7000.00
25	RI	6700.00
15	MD	2971.00
21	OH	2950.00
26	TX	1800.00
19	NV	1800.00
23	OR	1400.00
11	ID	1000.00
14	MA	600.00
17	MO	550.00
2	AZ	500.00
16	MI	400.00
18	NC	200.00
22	OK	200.00
1	AL	100.00
13	LA	100.00
10	GA	100.00
6	CT	100.00
3	BC	0.00
0	AB	0.00

11.2. Grouping by multiple fields¶

Finding the top contributors is almost as easy, but since the first and last names are spread between two fields we’ll need to submit them to groupby as a list. Copy the last line above, and replace “contributor_state” with a list like the one here:

merged_prop.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
483	SEAN	PARKER AND AFFILIATED ENTITIES	7250000.0
482	SEAN	PARKER	1500000.0
281	JULIE	SCHAUER	1364000.0
108	DANIEL	LEWIS	1250000.0
194	HENRY	VAN AMERINGEN	1000000.0
...	...	...	...
430	RICH	GOLDBERG	25.0
358	MARSTON	MEDLOCK	0.0
497	STEPHANIE	WARREN-SAMARIPA	0.0
16	AMIRGHASEM	SAMIMI-ARDESTAN	0.0
136	DAVID	WOOD	0.0

549 rows × 3 columns

You should be noticing that several of the top contributors appear to be the same person with their name entered in slightly different ways. This is another important lesson of campaign contributions data. Virtually none of the data is standardized by the campaigns or the government. The onus is on the analyst to show caution and responsibly combine records where the name fields refer to the same person.

To find out if each contributor supported or opposed the measure, you simply add that field to our groupby method.

merged_prop.groupby(["contributor_firstname", "contributor_lastname", "committee_position"]).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	committee_position	amount
483	SEAN	PARKER AND AFFILIATED ENTITIES	SUPPORT	7250000.0
482	SEAN	PARKER	SUPPORT	1500000.0
281	JULIE	SCHAUER	OPPOSE	1364000.0
108	DANIEL	LEWIS	SUPPORT	1250000.0
194	HENRY	VAN AMERINGEN	SUPPORT	1000000.0
...	...	...	...	...
430	RICH	GOLDBERG	SUPPORT	25.0
358	MARSTON	MEDLOCK	SUPPORT	0.0
497	STEPHANIE	WARREN-SAMARIPA	SUPPORT	0.0
16	AMIRGHASEM	SAMIMI-ARDESTAN	SUPPORT	0.0
136	DAVID	WOOD	SUPPORT	0.0

549 rows × 4 columns

If you want the top supporters or opponents alone, run those same commands with the support and oppose datasets we filtered down to earlier. Everything else about the commands would be the same as the first one above.

For the supporters:

support.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
444	SEAN	PARKER AND AFFILIATED ENTITIES	7250000.0
443	SEAN	PARKER	1500000.0
102	DANIEL	LEWIS	1250000.0
180	HENRY	VAN AMERINGEN	1000000.0
374	NICHOLAS	PRITZKER	900000.0
...	...	...	...
328	MARK	SLABAUGH	25.0
128	DAVID	WOOD	0.0
16	AMIRGHASEM	SAMIMI-ARDESTAN	0.0
458	STEPHANIE	WARREN-SAMARIPA	0.0
331	MARSTON	MEDLOCK	0.0

507 rows × 3 columns

For the opponents:

oppose.groupby(["contributor_firstname", "contributor_lastname"]).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
20	JULIE	SCHAUER	1364000.0
31	ROBERT	ADAMS	6500.0
29	MICHAEL	DEAN	5000.0
13	GEORGE	BELL	1550.0
12	FARSHAD	FARDAD	1000.0
17	JESSICA	MACIEL	1000.0
9	EDITH	REUVENI	1000.0
3	CHRISTINE	MILLER	871.0
25	LORI	ROBINSON	817.0
8	DWIGHT	BELL	730.0
23	KOLALEH	TABIBZADEH	500.0
24	KOUROSH	MADDAHI	500.0
0	ANDREW	GRZYWA	500.0
39	THOMAS	DAWSON	500.0
36	RONALD	LAWRENCE	300.0
22	KEN	CORNEY	300.0
27	MARTIN	RYAN	250.0
14	ISRAEL	CARRERO	250.0
30	MIGUEL	COLON JR	200.0
37	SAM	SPIEGEL	200.0
41	WILLIAM	WINTERS	200.0
6	DAVID	DARRIN	200.0
19	JOHN	ROBERTS	200.0
2	CECILIA	GALVIN	150.0
15	JAMES	DAVIS	150.0
26	MAJID	TALEBI	150.0
5	DAN	DESMIDT	150.0
33	ROBERT	HANDY	100.0
40	WILLIAM	MADDEN	100.0
38	SCOTT	SEAMAN	100.0
4	CLARENCE	STEWART	100.0
35	RONALD	JOHNSON	100.0
34	ROBERT	SANDERSON	100.0
1	BILL	KOUSENS	100.0
32	ROBERT	ELSBERG	100.0
7	DAVID	MCCARTNEY	100.0
10	EILEEN	JANDRISEVITS,MS,' BY CADC-I,	100.0
28	MIA	CHUNG	100.0
11	ERIC	NUNEZ	100.0
16	JEREMY	FORBES	100.0
18	JOHN	CALLAS	100.0
21	KAREN	AKAGI	100.0

11.3. How to not be wrong¶

You’ve done it. Our brief interview is complete and you’ve answered the big question that started our inquiry.

Or so you think! Look again at our rankings above. Now compare them against the ranking we looked at earlier in our sorting lesson.

Study it closely and you’ll see an important difference. All of the contributors without a first name are dropped from our groupby lists. And some of them gave a lot of money.

This is happening because of another pandas quirk. Empty fields are read in by pandas as null values, which is a mathematical representation of nothing. In pandas a null is called a NaN, an abbreviation for “not a number” commonly used in computer programming.

And, bad news, pandas’ groupby method will drop any rows with nulls in the grouping fields. So all those records without a first name were silently excluded from our analysis. Yikes!

Whatever our opinion of pandas’ default behavior, it’s something we need to account for, and a reminder that we should never assume we know what computer programming tools are doing under the hood. As with human sources, everything your code tells you should be viewed skeptically and verified.

The solution to this problem is easy. We need to instruct pandas not to drop the null values by adding an extra option to groupby.

merged_prop.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
483	SEAN	PARKER AND AFFILIATED ENTITIES	7250000.0
580	NaN	FUND FOR POLICY REFORM	6140000.0
581	NaN	FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4))	3940000.0
573	NaN	DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ...	3000000.0
598	NaN	NEW APPROACH PAC (MPO)	2615000.0
...	...	...	...
448	ROBERT	LOCKART	25.0
16	AMIRGHASEM	SAMIMI-ARDESTAN	0.0
358	MARSTON	MEDLOCK	0.0
497	STEPHANIE	WARREN-SAMARIPA	0.0
136	DAVID	WOOD	0.0

620 rows × 3 columns

Now we’ve finally got a ranking we can work with. Congratulations, you’ve finished our analysis.

11.4. Creating a New Column¶

Let’s say we wanted to take an extra step to learn whether these top contributors donate from inside the state or outside of it.

We could just add “contributor_state” to the groupby statement.

merged_prop.groupby(["contributor_firstname", "contributor_lastname","contributor_state"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	contributor_state	amount
484	SEAN	PARKER AND AFFILIATED ENTITIES	CA	7250000.0
582	NaN	FUND FOR POLICY REFORM	DE	6140000.0
583	NaN	FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4))	NY	3940000.0
575	NaN	DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ...	CA	3000000.0
600	NaN	NEW APPROACH PAC (MPO)	DC	2615000.0
...	...	...	...	...
541	WILLIAM	SANTSCHI	CA	25.0
498	STEPHANIE	WARREN-SAMARIPA	CA	0.0
16	AMIRGHASEM	SAMIMI-ARDESTAN	BC	0.0
136	DAVID	WOOD	AB	0.0
359	MARSTON	MEDLOCK	CA	0.0

622 rows × 4 columns

But we might want to quickly sort in-state vs. out-of-state donors, or perform similar calculations, and just adding the contributor state doesn’t allow us to quickly evaluate trends. Plus, adding the “state” adds two more rows to our ranking list, probably because a donor within the same organization reported their location differently across multiple donations.

We could try grouping by state alone instead, to get a better sense of it.

merged_prop.groupby(["contributor_state"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_state	amount
4	CA	17491193.00
8	DE	6140100.00
20	NY	5446985.00
7	DC	2618600.00
9	FL	1250850.00
24	PA	1215475.00
27	VA	895243.64
28	WA	50800.00
5	CO	39300.00
12	IL	7000.00
25	RI	6700.00
15	MD	2971.00
21	OH	2950.00
26	TX	1800.00
19	NV	1800.00
23	OR	1400.00
11	ID	1000.00
14	MA	600.00
17	MO	550.00
2	AZ	500.00
16	MI	400.00
18	NC	200.00
22	OK	200.00
1	AL	100.00
13	LA	100.00
10	GA	100.00
6	CT	100.00
3	BC	0.00
0	AB	0.00

Or we could filter to calculate just to California, and then filter again to not California. The filter lets us check whether a statement is true (in this case, does contributor_state equal 'CA'). Then it performs an operation on the results.

merged_prop[merged_prop["contributor_state"] == "CA"]["amount"].sum()

17491193.0

merged_prop[merged_prop["contributor_state"] != "CA"]["amount"].sum()

17685824.64

But what if we want a quick way to group by “California” vs. “not-California,” and we want to be able to refer to that later? Instead, we can use conditionals to create a new column based on whether or not a candidate is in-state. Then we can group by that column.

There are a few ways to achieve this. We’re going to use a filter to create a True/False flag, which is a Boolean data type.

merged_prop["in_state"] = merged_prop["contributor_state"] == "CA"

/tmp/ipykernel_134/14140203.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  merged_prop["in_state"] = merged_prop["contributor_state"] == "CA"

This basically says, “create a new column in merged called in_state. Use contributor_state as the basis. When a row in contributor state equals the string CA, that means in_state should be set to equal True. In all other circumstances, in_state will equal False.”

Now, we can see our new column in the dataframe. It will show up on the far right if you don’t specify a location.

merged_prop.head()

	ocd_prop_id	calaccess_prop_id	ccdc_prop_id	prop_name	ccdc_committee_id	calaccess_committee_id	committee_name_x	committee_position	committee_name_y	calaccess_filing_id	...	contributor_lastname	contributor_firstname	contributor_city	contributor_state	contributor_zip	contributor_employer	contributor_occupation	contributor_is_self_employed	amount	in_state
2102	ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e	1381868	83	PROPOSITION 064- MARIJUANA LEGALIZATION. INITI...	465	1385745	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	SUPPORT	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	2057298	...	FUND FOR POLICY REFORM	NaN	WILMINGTON	DE	19809	NaN	NaN	False	550000.0	False
2103	ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e	1381868	83	PROPOSITION 064- MARIJUANA LEGALIZATION. INITI...	465	1385745	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	SUPPORT	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	2057298	...	FUND FOR POLICY REFORM	NaN	WILMINGTON	DE	19809	NaN	NaN	False	420000.0	False
2104	ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e	1381868	83	PROPOSITION 064- MARIJUANA LEGALIZATION. INITI...	465	1385745	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	SUPPORT	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	2082414	...	FUND FOR POLICY REFORM	NaN	WILMINGTON	DE	19809	NaN	NaN	False	1000000.0	False
2105	ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e	1381868	83	PROPOSITION 064- MARIJUANA LEGALIZATION. INITI...	465	1385745	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	SUPPORT	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	2082414	...	FUND FOR POLICY REFORM	NaN	WILMINGTON	DE	19809	NaN	NaN	False	1970000.0	False
2106	ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e	1381868	83	PROPOSITION 064- MARIJUANA LEGALIZATION. INITI...	465	1385745	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	SUPPORT	FUND FOR POLICY REFORM (NONPROFIT 501(C)(4))	2096507	...	FUND FOR POLICY REFORM	NaN	WILMINGTON	DE	19809	NaN	NaN	False	1403000.0	False

5 rows × 21 columns

Let’s use our earlier groupby and sum code, but group by the in_state flag instead of by the contributor’s state.

merged_prop.groupby(["in_state"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	in_state	amount
0	False	17685824.64
1	True	17491193.00

Notice that these totals match our “California” vs. “not-California” sum totals that we calculated with the filtered calculations up above. That’s good! This is one way to QA your new column. If your totals didn’t match, it means you should go back and double-check the logic in your conditional statement that’s creating the new column.

We can also create a new dataframe for just in-state donors.

in_state = merged_prop[merged_prop.in_state == True]
out_state = merged_prop[merged_prop.in_state == False]

And check what proportion of the funding came from in-state, overall.

in_state.amount.sum() / merged_prop.amount.sum()

0.49723353977884294

We can also easily create ranked lists of the top donors from within the state and outside of the state.

in_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
386	SEAN	PARKER AND AFFILIATED ENTITIES	7250000.0
467	NaN	DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ...	3000000.0
385	SEAN	PARKER	1500000.0
324	NICHOLAS	PRITZKER	900000.0
474	NaN	GHOST MANAGEMENT GROUP, LLC DBA WEEDMAPS	850000.0
...	...	...	...
357	ROBERT	LOCKART	25.0
142	GEORGE	OKEEFE	25.0
436	WILLIAM	SANTSCHI	25.0
286	MARSTON	MEDLOCK	0.0
400	STEPHANIE	WARREN-SAMARIPA	0.0

499 rows × 3 columns

out_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)

	contributor_firstname	contributor_lastname	amount
108	NaN	FUND FOR POLICY REFORM	6140000.0
109	NaN	FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4))	3940000.0
115	NaN	NEW APPROACH PAC (MPO)	2615000.0
28	DANIEL	LEWIS	1250000.0
61	JULIE	SCHAUER	1214000.0
...	...	...	...
86	PETER	SCHORSCH	100.0
87	RALPH	RISCH	100.0
0	ADAM	ORENS	100.0
32	DAVID	WOOD	0.0
4	AMIRGHASEM	SAMIMI-ARDESTAN	0.0

123 rows × 3 columns

You can use conditionals to create any number of similar flags, which will let you slice and dice your contributor lists to your heart’s content. This can be a powerful tool to look at data from different angles, narrow an existing analysis, or answer specific reporting questions.

11. Groupby¶

11.1. Grouping by one field¶

11.2. Grouping by multiple fields¶

11.3. How to not be wrong¶

11.4. Creating a New Column¶

First Python Notebook

Navigation

Related Topics

Donate/support