8. Merge¶
Our next job is to filter down the contributions list, which includes all disclosed contributions to all proposition campaigns, to just those linked to Proposition 64.
When joining two tables together, the first step is to look carefully at the columns in each table. We can do that with the info
command we learned earlier.
import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'
First the contributions.
contrib_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56379 entries, 0 to 56378
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 calaccess_committee_id 56379 non-null int64
1 committee_name 56379 non-null object
2 calaccess_filing_id 56379 non-null int64
3 date_received 56379 non-null object
4 contributor_lastname 56375 non-null object
5 contributor_firstname 53411 non-null object
6 contributor_city 56369 non-null object
7 contributor_state 56363 non-null object
8 contributor_zip 56366 non-null object
9 contributor_employer 48572 non-null object
10 contributor_occupation 53273 non-null object
11 contributor_is_self_employed 56379 non-null bool
12 amount 56379 non-null float64
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 5.2+ MB
Now compare that to the committee file.
committee_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ocd_prop_id 102 non-null object
1 calaccess_prop_id 102 non-null int64
2 ccdc_prop_id 102 non-null int64
3 prop_name 102 non-null object
4 ccdc_committee_id 102 non-null int64
5 calaccess_committee_id 102 non-null int64
6 committee_name 102 non-null object
7 committee_position 102 non-null object
dtypes: int64(4), object(4)
memory usage: 6.5+ KB
You will notice that each file contains a field called calaccess_committee_id
. That’s because these two files are drawn from a “relational database” that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.
We can therefore safely join the two files using the pandas merge method.
Note
If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN
statement. If you dig into merge’s documentation you will see it has many of the same options.
8.1. Merging DataFrames¶
That’s exactly what we want to do. So let’s try it.
Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, merge_everything
.
merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")
That new DataFrame variable can be inspected like any other.
merged_everything.head()
ocd_prop_id | calaccess_prop_id | ccdc_prop_id | prop_name | ccdc_committee_id | calaccess_committee_id | committee_name_x | committee_position | committee_name_y | calaccess_filing_id | date_received | contributor_lastname | contributor_firstname | contributor_city | contributor_state | contributor_zip | contributor_employer | contributor_occupation | contributor_is_self_employed | amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | 1965180 | 2015-01-15 | CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... | NaN | SACRAMENTO | CA | 95814 | NaN | NaN | False | 50000.0 |
1 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | 1965180 | 2015-01-16 | COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C... | NaN | SAN RAFAEL | CA | 94901 | NaN | NaN | False | 55000.0 |
2 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | 1965180 | 2015-01-28 | CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... | NaN | SACRAMENTO | CA | 95814 | NaN | NaN | False | 140000.0 |
3 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | 1965180 | 2015-01-28 | COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C... | NaN | SAN RAFAEL | CA | 94901 | NaN | NaN | False | 140000.0 |
4 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | 1965180 | 2015-03-23 | CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... | NaN | SACRAMENTO | CA | 95814 | NaN | NaN | False | 50000.0 |
By looking at the columns you can check how many rows survived the merge.
merged_everything.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90264 entries, 0 to 90263
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ocd_prop_id 90264 non-null object
1 calaccess_prop_id 90264 non-null int64
2 ccdc_prop_id 90264 non-null int64
3 prop_name 90264 non-null object
4 ccdc_committee_id 90264 non-null int64
5 calaccess_committee_id 90264 non-null int64
6 committee_name_x 90264 non-null object
7 committee_position 90264 non-null object
8 committee_name_y 90264 non-null object
9 calaccess_filing_id 90264 non-null int64
10 date_received 90264 non-null object
11 contributor_lastname 90256 non-null object
12 contributor_firstname 86685 non-null object
13 contributor_city 90251 non-null object
14 contributor_state 90244 non-null object
15 contributor_zip 90248 non-null object
16 contributor_employer 81684 non-null object
17 contributor_occupation 86545 non-null object
18 contributor_is_self_employed 90264 non-null bool
19 amount 90264 non-null float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 13.9+ MB
You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.
8.2. Filtering to a single proposition¶
The combined table now joins all contributions to all committees. To zero on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we’ll filter our new merged
DataFrame instead.
merged_prop = merged_everything[merged_everything.prop_name == my_prop]
We have now created a new dataset that includes only contributions supporting and opposing Proposition 64.
merged_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 860 entries, 2102 to 83888
Data columns (total 20 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ocd_prop_id 860 non-null object
1 calaccess_prop_id 860 non-null int64
2 ccdc_prop_id 860 non-null int64
3 prop_name 860 non-null object
4 ccdc_committee_id 860 non-null int64
5 calaccess_committee_id 860 non-null int64
6 committee_name_x 860 non-null object
7 committee_position 860 non-null object
8 committee_name_y 860 non-null object
9 calaccess_filing_id 860 non-null int64
10 date_received 860 non-null object
11 contributor_lastname 860 non-null object
12 contributor_firstname 750 non-null object
13 contributor_city 860 non-null object
14 contributor_state 860 non-null object
15 contributor_zip 860 non-null object
16 contributor_employer 657 non-null object
17 contributor_occupation 742 non-null object
18 contributor_is_self_employed 860 non-null bool
19 amount 860 non-null float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 135.2+ KB
We’re ready to move on from preparing our data. It’s time to interview it.