8. Merge

Our next job is to filter down the contributions list, which includes all disclosed contributions to all proposition campaigns, to just those linked to Proposition 64.

When joining two tables together, the first step is to look carefully at the columns in each table. We can do that with the info command we learned earlier.

import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'

First the contributions.

contrib_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56379 entries, 0 to 56378
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   calaccess_committee_id        56379 non-null  int64  
 1   committee_name                56379 non-null  object 
 2   calaccess_filing_id           56379 non-null  int64  
 3   date_received                 56379 non-null  object 
 4   contributor_lastname          56375 non-null  object 
 5   contributor_firstname         53411 non-null  object 
 6   contributor_city              56369 non-null  object 
 7   contributor_state             56363 non-null  object 
 8   contributor_zip               56366 non-null  object 
 9   contributor_employer          48572 non-null  object 
 10  contributor_occupation        53273 non-null  object 
 11  contributor_is_self_employed  56379 non-null  bool   
 12  amount                        56379 non-null  float64
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 5.2+ MB

Now compare that to the committee file.

committee_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   ocd_prop_id             102 non-null    object
 1   calaccess_prop_id       102 non-null    int64 
 2   ccdc_prop_id            102 non-null    int64 
 3   prop_name               102 non-null    object
 4   ccdc_committee_id       102 non-null    int64 
 5   calaccess_committee_id  102 non-null    int64 
 6   committee_name          102 non-null    object
 7   committee_position      102 non-null    object
dtypes: int64(4), object(4)
memory usage: 6.5+ KB

You will notice that each file contains a field called calaccess_committee_id . That’s because these two files are drawn from a “relational database” that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.

We can therefore safely join the two files using the pandas merge method.

Note

If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN statement. If you dig into merge’s documentation you will see it has many of the same options.

8.1. Merging DataFrames

That’s exactly what we want to do. So let’s try it.

Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, merge_everything.

merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")

That new DataFrame variable can be inspected like any other.

merged_everything.head()
ocd_prop_id calaccess_prop_id ccdc_prop_id prop_name ccdc_committee_id calaccess_committee_id committee_name_x committee_position committee_name_y calaccess_filing_id date_received contributor_lastname contributor_firstname contributor_city contributor_state contributor_zip contributor_employer contributor_occupation contributor_is_self_employed amount
0 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... 1965180 2015-01-15 CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... NaN SACRAMENTO CA 95814 NaN NaN False 50000.0
1 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... 1965180 2015-01-16 COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C... NaN SAN RAFAEL CA 94901 NaN NaN False 55000.0
2 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... 1965180 2015-01-28 CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... NaN SACRAMENTO CA 95814 NaN NaN False 140000.0
3 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... 1965180 2015-01-28 COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C... NaN SAN RAFAEL CA 94901 NaN NaN False 140000.0
4 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... 1965180 2015-03-23 CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE... NaN SACRAMENTO CA 95814 NaN NaN False 50000.0

By looking at the columns you can check how many rows survived the merge.

merged_everything.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 90264 entries, 0 to 90263
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ocd_prop_id                   90264 non-null  object 
 1   calaccess_prop_id             90264 non-null  int64  
 2   ccdc_prop_id                  90264 non-null  int64  
 3   prop_name                     90264 non-null  object 
 4   ccdc_committee_id             90264 non-null  int64  
 5   calaccess_committee_id        90264 non-null  int64  
 6   committee_name_x              90264 non-null  object 
 7   committee_position            90264 non-null  object 
 8   committee_name_y              90264 non-null  object 
 9   calaccess_filing_id           90264 non-null  int64  
 10  date_received                 90264 non-null  object 
 11  contributor_lastname          90256 non-null  object 
 12  contributor_firstname         86685 non-null  object 
 13  contributor_city              90251 non-null  object 
 14  contributor_state             90244 non-null  object 
 15  contributor_zip               90248 non-null  object 
 16  contributor_employer          81684 non-null  object 
 17  contributor_occupation        86545 non-null  object 
 18  contributor_is_self_employed  90264 non-null  bool   
 19  amount                        90264 non-null  float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 13.9+ MB

You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.

8.2. Filtering to a single proposition

The combined table now joins all contributions to all committees. To zero on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we’ll filter our new merged DataFrame instead.

merged_prop = merged_everything[merged_everything.prop_name == my_prop]

We have now created a new dataset that includes only contributions supporting and opposing Proposition 64.

merged_prop.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 860 entries, 2102 to 83888
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   ocd_prop_id                   860 non-null    object 
 1   calaccess_prop_id             860 non-null    int64  
 2   ccdc_prop_id                  860 non-null    int64  
 3   prop_name                     860 non-null    object 
 4   ccdc_committee_id             860 non-null    int64  
 5   calaccess_committee_id        860 non-null    int64  
 6   committee_name_x              860 non-null    object 
 7   committee_position            860 non-null    object 
 8   committee_name_y              860 non-null    object 
 9   calaccess_filing_id           860 non-null    int64  
 10  date_received                 860 non-null    object 
 11  contributor_lastname          860 non-null    object 
 12  contributor_firstname         750 non-null    object 
 13  contributor_city              860 non-null    object 
 14  contributor_state             860 non-null    object 
 15  contributor_zip               860 non-null    object 
 16  contributor_employer          657 non-null    object 
 17  contributor_occupation        742 non-null    object 
 18  contributor_is_self_employed  860 non-null    bool   
 19  amount                        860 non-null    float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 135.2+ KB

We’re ready to move on from preparing our data. It’s time to interview it.