12. Compute

This chapter will show how you can create a new column based on the data in other columns, a process sometimes known as “computing.”

import warnings
warnings.simplefilter("ignore")
import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
my_committees = committee_list[committee_list.prop_name == my_prop]
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
support = merged_prop[merged_prop.committee_position == 'SUPPORT']
oppose = merged_prop[merged_prop.committee_position == 'OPPOSE']

12.1. Create a column

Let’s say we wanted to take an extra step beyond last chapter to learn which side got more money from outside of California.

As before, we could start by adding the contributor_state column to the groupby statement.

merged_prop.groupby(["contributor_firstname", "contributor_lastname", "contributor_state"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname contributor_state amount
484 SEAN PARKER AND AFFILIATED ENTITIES CA 7250000.0
582 NaN FUND FOR POLICY REFORM DE 6140000.0
583 NaN FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4)) NY 3940000.0
575 NaN DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ... CA 3000000.0
600 NaN NEW APPROACH PAC (MPO) DC 2615000.0
... ... ... ... ...
541 WILLIAM SANTSCHI CA 25.0
498 STEPHANIE WARREN-SAMARIPA CA 0.0
16 AMIRGHASEM SAMIMI-ARDESTAN BC 0.0
136 DAVID WOOD AB 0.0
359 MARSTON MEDLOCK CA 0.0

622 rows × 4 columns

We could try grouping by state alone instead, to get a better sense of it.

merged_prop.groupby("contributor_state", dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_state amount
4 CA 17491193.00
8 DE 6140100.00
20 NY 5446985.00
7 DC 2618600.00
9 FL 1250850.00
24 PA 1215475.00
27 VA 895243.64
28 WA 50800.00
5 CO 39300.00
12 IL 7000.00
25 RI 6700.00
15 MD 2971.00
21 OH 2950.00
26 TX 1800.00
19 NV 1800.00
23 OR 1400.00
11 ID 1000.00
14 MA 600.00
17 MO 550.00
2 AZ 500.00
16 MI 400.00
18 NC 200.00
22 OK 200.00
1 AL 100.00
13 LA 100.00
10 GA 100.00
6 CT 100.00
3 BC 0.00
0 AB 0.00

Or we could filter to just California donors.

merged_prop[merged_prop["contributor_state"] == "CA"]["amount"].sum()
17491193.0

And then filter again to those outside of California.

merged_prop[merged_prop["contributor_state"] != "CA"]["amount"].sum()
17685824.64

Each one of these methods has its place. But to advance to another level of sophistication, and to simplify our code, it’s often helpful to create a new column that stores values calculated off other fields. Then we can group by the new column to get the answers we’re after.

There are a few ways to achieve this. We’re going to start with an expression that tests the contributor_state field and returns true or false, much like the ones we’ve used before in filters.

merged_prop["in_state"] = merged_prop.contributor_state == "CA"

This basically says, “Create a new column name in_state using contributor_state as the basis. When a row in contributor_state equals CA, that means in_state should be True. In all other circumstances, in_state will equal False.”

Now, we can see our new column in the DataFrame. It will show up on the far right of the table.

merged_prop.head()
ocd_prop_id calaccess_prop_id ccdc_prop_id prop_name ccdc_committee_id calaccess_committee_id committee_name_x committee_position committee_name_y calaccess_filing_id ... contributor_lastname contributor_firstname contributor_city contributor_state contributor_zip contributor_employer contributor_occupation contributor_is_self_employed amount in_state
2102 ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e 1381868 83 PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... 465 1385745 FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) SUPPORT FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) 2057298 ... FUND FOR POLICY REFORM NaN WILMINGTON DE 19809 NaN NaN False 550000.0 False
2103 ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e 1381868 83 PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... 465 1385745 FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) SUPPORT FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) 2057298 ... FUND FOR POLICY REFORM NaN WILMINGTON DE 19809 NaN NaN False 420000.0 False
2104 ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e 1381868 83 PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... 465 1385745 FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) SUPPORT FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) 2082414 ... FUND FOR POLICY REFORM NaN WILMINGTON DE 19809 NaN NaN False 1000000.0 False
2105 ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e 1381868 83 PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... 465 1385745 FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) SUPPORT FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) 2082414 ... FUND FOR POLICY REFORM NaN WILMINGTON DE 19809 NaN NaN False 1970000.0 False
2106 ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e 1381868 83 PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... 465 1385745 FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) SUPPORT FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) 2096507 ... FUND FOR POLICY REFORM NaN WILMINGTON DE 19809 NaN NaN False 1403000.0 False

5 rows × 21 columns

12.2. Analyze with groupby

Let’s use our groupby and sum method on the in_state flag.

merged_prop.groupby("in_state", dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
in_state amount
0 False 17685824.64
1 True 17491193.00

Note

Notice that these totals match the totals that we calculated with the filtered calculations above. That’s good! This is one way to verify your new column. If your totals don’t match, it means you should go back and doublecheck your conditional statement that’s creating the new column.

Let’s do a little more. We can now create a new DataFrame for just in-state donors.

in_state = merged_prop[merged_prop.in_state == True]

And check the overall proportion of funding that came from inside the state.

in_state.amount.sum() / merged_prop.amount.sum()
0.49723353977884294

We can also easily create ranked lists of the top donors from within the state.

in_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
386 SEAN PARKER AND AFFILIATED ENTITIES 7250000.0
467 NaN DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ... 3000000.0
385 SEAN PARKER 1500000.0
324 NICHOLAS PRITZKER 900000.0
474 NaN GHOST MANAGEMENT GROUP, LLC DBA WEEDMAPS 850000.0
... ... ... ...
357 ROBERT LOCKART 25.0
142 GEORGE OKEEFE 25.0
436 WILLIAM SANTSCHI 25.0
286 MARSTON MEDLOCK 0.0
400 STEPHANIE WARREN-SAMARIPA 0.0

499 rows × 3 columns

And do the same the for those outside the state. First by making a DataFrame.

out_state = merged_prop[merged_prop.in_state == False]

Then by swapping our new variable into the line of code above.

out_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname contributor_lastname amount
108 NaN FUND FOR POLICY REFORM 6140000.0
109 NaN FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4)) 3940000.0
115 NaN NEW APPROACH PAC (MPO) 2615000.0
28 DANIEL LEWIS 1250000.0
61 JULIE SCHAUER 1214000.0
... ... ... ...
86 PETER SCHORSCH 100.0
87 RALPH RISCH 100.0
0 ADAM ORENS 100.0
32 DAVID WOOD 0.0
4 AMIRGHASEM SAMIMI-ARDESTAN 0.0

123 rows × 3 columns