12. Compute¶
This chapter will show how you can create a new column based on the data in other columns, a process sometimes known as “computing.”
Sections
import warnings
warnings.simplefilter("ignore")
import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
my_committees = committee_list[committee_list.prop_name == my_prop]
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
support = merged_prop[merged_prop.committee_position == 'SUPPORT']
oppose = merged_prop[merged_prop.committee_position == 'OPPOSE']
12.1. Create a column¶
Let’s say we wanted to take an extra step beyond last chapter to learn which side got more money from outside of California.
As before, we could start by adding the contributor_state
column to the groupby
statement.
merged_prop.groupby(["contributor_firstname", "contributor_lastname", "contributor_state"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname | contributor_lastname | contributor_state | amount | |
---|---|---|---|---|
484 | SEAN | PARKER AND AFFILIATED ENTITIES | CA | 7250000.0 |
582 | NaN | FUND FOR POLICY REFORM | DE | 6140000.0 |
583 | NaN | FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4)) | NY | 3940000.0 |
575 | NaN | DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ... | CA | 3000000.0 |
600 | NaN | NEW APPROACH PAC (MPO) | DC | 2615000.0 |
... | ... | ... | ... | ... |
541 | WILLIAM | SANTSCHI | CA | 25.0 |
498 | STEPHANIE | WARREN-SAMARIPA | CA | 0.0 |
16 | AMIRGHASEM | SAMIMI-ARDESTAN | BC | 0.0 |
136 | DAVID | WOOD | AB | 0.0 |
359 | MARSTON | MEDLOCK | CA | 0.0 |
622 rows × 4 columns
We could try grouping by state alone instead, to get a better sense of it.
merged_prop.groupby("contributor_state", dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_state | amount | |
---|---|---|
4 | CA | 17491193.00 |
8 | DE | 6140100.00 |
20 | NY | 5446985.00 |
7 | DC | 2618600.00 |
9 | FL | 1250850.00 |
24 | PA | 1215475.00 |
27 | VA | 895243.64 |
28 | WA | 50800.00 |
5 | CO | 39300.00 |
12 | IL | 7000.00 |
25 | RI | 6700.00 |
15 | MD | 2971.00 |
21 | OH | 2950.00 |
26 | TX | 1800.00 |
19 | NV | 1800.00 |
23 | OR | 1400.00 |
11 | ID | 1000.00 |
14 | MA | 600.00 |
17 | MO | 550.00 |
2 | AZ | 500.00 |
16 | MI | 400.00 |
18 | NC | 200.00 |
22 | OK | 200.00 |
1 | AL | 100.00 |
13 | LA | 100.00 |
10 | GA | 100.00 |
6 | CT | 100.00 |
3 | BC | 0.00 |
0 | AB | 0.00 |
Or we could filter to just California donors.
merged_prop[merged_prop["contributor_state"] == "CA"]["amount"].sum()
17491193.0
And then filter again to those outside of California.
merged_prop[merged_prop["contributor_state"] != "CA"]["amount"].sum()
17685824.64
Each one of these methods has its place. But to advance to another level of sophistication, and to simplify our code, it’s often helpful to create a new column that stores values calculated off other fields. Then we can group by the new column to get the answers we’re after.
There are a few ways to achieve this. We’re going to start with an expression that tests the contributor_state
field and returns true or false, much like the ones we’ve used before in filters.
merged_prop["in_state"] = merged_prop.contributor_state == "CA"
This basically says, “Create a new column name in_state
using contributor_state
as the basis. When a row in contributor_state
equals CA
, that means in_state
should be True
. In all other circumstances, in_state
will equal False
.”
Now, we can see our new column in the DataFrame. It will show up on the far right of the table.
merged_prop.head()
ocd_prop_id | calaccess_prop_id | ccdc_prop_id | prop_name | ccdc_committee_id | calaccess_committee_id | committee_name_x | committee_position | committee_name_y | calaccess_filing_id | ... | contributor_lastname | contributor_firstname | contributor_city | contributor_state | contributor_zip | contributor_employer | contributor_occupation | contributor_is_self_employed | amount | in_state | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2102 | ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e | 1381868 | 83 | PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... | 465 | 1385745 | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | SUPPORT | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | 2057298 | ... | FUND FOR POLICY REFORM | NaN | WILMINGTON | DE | 19809 | NaN | NaN | False | 550000.0 | False |
2103 | ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e | 1381868 | 83 | PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... | 465 | 1385745 | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | SUPPORT | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | 2057298 | ... | FUND FOR POLICY REFORM | NaN | WILMINGTON | DE | 19809 | NaN | NaN | False | 420000.0 | False |
2104 | ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e | 1381868 | 83 | PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... | 465 | 1385745 | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | SUPPORT | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | 2082414 | ... | FUND FOR POLICY REFORM | NaN | WILMINGTON | DE | 19809 | NaN | NaN | False | 1000000.0 | False |
2105 | ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e | 1381868 | 83 | PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... | 465 | 1385745 | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | SUPPORT | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | 2082414 | ... | FUND FOR POLICY REFORM | NaN | WILMINGTON | DE | 19809 | NaN | NaN | False | 1970000.0 | False |
2106 | ocd-contest/232cb72c-1b11-4293-b4e2-a181e80d172e | 1381868 | 83 | PROPOSITION 064- MARIJUANA LEGALIZATION. INITI... | 465 | 1385745 | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | SUPPORT | FUND FOR POLICY REFORM (NONPROFIT 501(C)(4)) | 2096507 | ... | FUND FOR POLICY REFORM | NaN | WILMINGTON | DE | 19809 | NaN | NaN | False | 1403000.0 | False |
5 rows × 21 columns
12.2. Analyze with groupby
¶
Let’s use our groupby
and sum
method on the in_state
flag.
merged_prop.groupby("in_state", dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
in_state | amount | |
---|---|---|
0 | False | 17685824.64 |
1 | True | 17491193.00 |
Note
Notice that these totals match the totals that we calculated with the filtered calculations above. That’s good! This is one way to verify your new column. If your totals don’t match, it means you should go back and doublecheck your conditional statement that’s creating the new column.
Let’s do a little more. We can now create a new DataFrame for just in-state donors.
in_state = merged_prop[merged_prop.in_state == True]
And check the overall proportion of funding that came from inside the state.
in_state.amount.sum() / merged_prop.amount.sum()
0.49723353977884294
We can also easily create ranked lists of the top donors from within the state.
in_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname | contributor_lastname | amount | |
---|---|---|---|
386 | SEAN | PARKER AND AFFILIATED ENTITIES | 7250000.0 |
467 | NaN | DRUG POLICY ACTION - NON PROFIT 501C4, YES ON ... | 3000000.0 |
385 | SEAN | PARKER | 1500000.0 |
324 | NICHOLAS | PRITZKER | 900000.0 |
474 | NaN | GHOST MANAGEMENT GROUP, LLC DBA WEEDMAPS | 850000.0 |
... | ... | ... | ... |
357 | ROBERT | LOCKART | 25.0 |
142 | GEORGE | OKEEFE | 25.0 |
436 | WILLIAM | SANTSCHI | 25.0 |
286 | MARSTON | MEDLOCK | 0.0 |
400 | STEPHANIE | WARREN-SAMARIPA | 0.0 |
499 rows × 3 columns
And do the same the for those outside the state. First by making a DataFrame.
out_state = merged_prop[merged_prop.in_state == False]
Then by swapping our new variable into the line of code above.
out_state.groupby(["contributor_firstname", "contributor_lastname"], dropna=False).amount.sum().reset_index().sort_values("amount", ascending=False)
contributor_firstname | contributor_lastname | amount | |
---|---|---|---|
108 | NaN | FUND FOR POLICY REFORM | 6140000.0 |
109 | NaN | FUND FOR POLICY REFORM (NONPROFIT 501 (C)(4)) | 3940000.0 |
115 | NaN | NEW APPROACH PAC (MPO) | 2615000.0 |
28 | DANIEL | LEWIS | 1250000.0 |
61 | JULIE | SCHAUER | 1214000.0 |
... | ... | ... | ... |
86 | PETER | SCHORSCH | 100.0 |
87 | RALPH | RISCH | 100.0 |
0 | ADAM | ORENS | 100.0 |
32 | DAVID | WOOD | 0.0 |
4 | AMIRGHASEM | SAMIMI-ARDESTAN | 0.0 |
123 rows × 3 columns