8. Merge¶

Our next job is to filter down the contributions list, which includes all disclosed contributions to all proposition campaigns, to just those linked to Proposition 64.

When joining two tables together, the first step is to look carefully at the columns in each table. We can do that with the info command we learned earlier.

import pandas as pd
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
my_prop = 'PROPOSITION 064- MARIJUANA LEGALIZATION. INITIATIVE STATUTE.'

First the contributions.

contrib_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56379 entries, 0 to 56378
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   calaccess_committee_id        56379 non-null  int64  
 1   committee_name                56379 non-null  object 
 2   calaccess_filing_id           56379 non-null  int64  
 3   date_received                 56379 non-null  object 
 4   contributor_lastname          56375 non-null  object 
 5   contributor_firstname         53411 non-null  object 
 6   contributor_city              56369 non-null  object 
 7   contributor_state             56363 non-null  object 
 8   contributor_zip               56366 non-null  object 
 9   contributor_employer          48572 non-null  object 
 10  contributor_occupation        53273 non-null  object 
 11  contributor_is_self_employed  56379 non-null  bool   
 12  amount                        56379 non-null  float64
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 5.2+ MB

Now compare that to the committee file.

committee_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   ocd_prop_id             102 non-null    object
 1   calaccess_prop_id       102 non-null    int64 
 2   ccdc_prop_id            102 non-null    int64 
 3   prop_name               102 non-null    object
 4   ccdc_committee_id       102 non-null    int64 
 5   calaccess_committee_id  102 non-null    int64 
 6   committee_name          102 non-null    object
 7   committee_position      102 non-null    object
dtypes: int64(4), object(4)
memory usage: 6.5+ KB

You will notice that each file contains a field called calaccess_committee_id . That’s because these two files are drawn from a “relational database” that stores data in an array of tables linked together by common identifiers. In this case, the unique identifying codes of committees in one table can be expected to match those found in another.

We can therefore safely join the two files using the pandas merge method.

Note

If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN statement. If you dig into merge’s documentation you will see it has many of the same options.

8.1. Merging DataFrames¶

That’s exactly what we want to do. So let’s try it.

Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, merge_everything.

merged_everything = pd.merge(committee_list, contrib_list, on="calaccess_committee_id")

That new DataFrame variable can be inspected like any other.

merged_everything.head()

	ocd_prop_id	calaccess_prop_id	ccdc_prop_id	prop_name	ccdc_committee_id	calaccess_committee_id	committee_name_x	committee_position	committee_name_y	calaccess_filing_id	date_received	contributor_lastname	contributor_firstname	contributor_city	contributor_state	contributor_zip	contributor_employer	contributor_occupation	contributor_is_self_employed	amount
0	ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6	1376258	70	PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-...	382	1374469	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	SUPPORT	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	1965180	2015-01-15	CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE...	NaN	SACRAMENTO	CA	95814	NaN	NaN	False	50000.0
1	ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6	1376258	70	PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-...	382	1374469	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	SUPPORT	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	1965180	2015-01-16	COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C...	NaN	SAN RAFAEL	CA	94901	NaN	NaN	False	55000.0
2	ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6	1376258	70	PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-...	382	1374469	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	SUPPORT	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	1965180	2015-01-28	CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE...	NaN	SACRAMENTO	CA	95814	NaN	NaN	False	140000.0
3	ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6	1376258	70	PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-...	382	1374469	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	SUPPORT	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	1965180	2015-01-28	COALITION FOR ADEQUATE SCHOOL HOUSING ISSUES C...	NaN	SAN RAFAEL	CA	94901	NaN	NaN	False	140000.0
4	ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6	1376258	70	PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-...	382	1374469	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	SUPPORT	YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI...	1965180	2015-03-23	CALIFORNIA BUILDING INDUSTRY ASSOCIATION ISSUE...	NaN	SACRAMENTO	CA	95814	NaN	NaN	False	50000.0

By looking at the columns you can check how many rows survived the merge.

merged_everything.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 90264 entries, 0 to 90263
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 ocd_prop_id                   90264 non-null  object 
 calaccess_prop_id             90264 non-null  int64  
 ccdc_prop_id                  90264 non-null  int64  
 prop_name                     90264 non-null  object 
 ccdc_committee_id             90264 non-null  int64  
 calaccess_committee_id        90264 non-null  int64  
 committee_name_x              90264 non-null  object 
 committee_position            90264 non-null  object 
 committee_name_y              90264 non-null  object 
 calaccess_filing_id           90264 non-null  int64  
date_received                 90264 non-null  object 
contributor_lastname          90256 non-null  object 
contributor_firstname         86685 non-null  object 
contributor_city              90251 non-null  object 
contributor_state             90244 non-null  object 
contributor_zip               90248 non-null  object 
contributor_employer          81684 non-null  object 
contributor_occupation        86545 non-null  object 
contributor_is_self_employed  90264 non-null  bool   
amount                        90264 non-null  float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 13.9+ MB

You can also see that the DataFrame now contains all of the columns in both tables. Columns with the same name have had a suffix automatically appended to indicate whether they came from the first or second DataFrame submitted to the merge.

8.2. Filtering to a single proposition¶

The combined table now joins all contributions to all committees. To zero on just the contributions to committees in the contest over Proposition 64, we’ll need to filter out data, much like we did in the last chapter. Only this time, we’ll filter our new merged DataFrame instead.

merged_prop = merged_everything[merged_everything.prop_name == my_prop]

We have now created a new dataset that includes only contributions supporting and opposing Proposition 64.

merged_prop.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 860 entries, 2102 to 83888
Data columns (total 20 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 ocd_prop_id                   860 non-null    object 
 calaccess_prop_id             860 non-null    int64  
 ccdc_prop_id                  860 non-null    int64  
 prop_name                     860 non-null    object 
 ccdc_committee_id             860 non-null    int64  
 calaccess_committee_id        860 non-null    int64  
 committee_name_x              860 non-null    object 
 committee_position            860 non-null    object 
 committee_name_y              860 non-null    object 
 calaccess_filing_id           860 non-null    int64  
date_received                 860 non-null    object 
contributor_lastname          860 non-null    object 
contributor_firstname         750 non-null    object 
contributor_city              860 non-null    object 
contributor_state             860 non-null    object 
contributor_zip               860 non-null    object 
contributor_employer          657 non-null    object 
contributor_occupation        742 non-null    object 
contributor_is_self_employed  860 non-null    bool   
amount                        860 non-null    float64
dtypes: bool(1), float64(1), int64(5), object(13)
memory usage: 135.2+ KB

We’re ready to move on from preparing our data. It’s time to interview it.

8. Merge¶

8.1. Merging DataFrames¶

8.2. Filtering to a single proposition¶

First Python Notebook

Navigation

Related Topics

Donate/support