5. Data

Now it’s time to get our hands on some real data.

The coalition has created simplified data files containing the disclosure forms that committees campaigning either for against one of the 17 propositions on the ballot in November 2016 filed with the state of California.

They are:

Name

Description

committees.csv

Committees active in the election linked to propositions supported or opposed

contributions.csv

Donors reported by each of the committees

The data are structured in rows of comma-separated values. This is known as a CSV file. It is the most common way you will find data published online.

5.1. Creating a DataFrame

The pandas library is able to read in files from a variety formats, including CSV.

If it’s not currently running start up your Jupyter Notebook as described in chapter two.

Scroll down to the first open cell. There we will import the first CSV file listed above using the read_csv function included with pandas.

import pandas as pd
pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
ocd_prop_id calaccess_prop_id ccdc_prop_id prop_name ccdc_committee_id calaccess_committee_id committee_name committee_position
0 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT
1 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 383 1220380 COMMUNITY COLLEGE FACILITY COALITION ISSUES CO... SUPPORT
2 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 384 1282321 TORLAKSON'S INVEST IN CALIFORNIA A BALLOT MEAS... SUPPORT
3 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 385 1382843 CALIFORNIA TAXPAYERS AND EDUCATORS OPPOSED TO ... OPPOSE
4 ocd-contest/85990193-9d6f-4600-b8e7-bf1317841d82 1362198 71 PROPOSITION 052 - STATE FEES ON HOSPITALS. FED... 386 1362973 YES ON PROPOSITION 52 - A COALITION OF CALIFOR... SUPPORT
... ... ... ... ... ... ... ... ...
97 ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 1372638 86 PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... 479 1329332 SAVE THE BAG BAN, YES ON 67, SPONSORED BY ENVI... SUPPORT
98 ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 1372638 86 PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... 480 1374885 YES ON 67 - CALIFORNIANS AGAINST WASTE - PROTE... SUPPORT
99 ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 1372638 86 PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... 481 1346973 CALIFORNIA GROCERS ASSOCIATION ISSUES COMMITTE... SUPPORT
100 ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 1372638 86 PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... 482 1381938 SAVE THE BAY ACTION FUND PAC - YES ON PROP 67 SUPPORT
101 ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 1372638 86 PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... 483 1389817 SAVE THE BAY ACTION FUND COMMITTEE TO SUPPORT ... SUPPORT

102 rows × 8 columns

Warning

You will need to precisely type in the URL to the file. Feel free to copy and paste it from the example above into your notebook.

After you run the cell, you should see a big table printed below.

It is a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might.

The advantage here is that rather than manipulating the data through a haphazard series of clicks and keypunches we will be gradually grinding down the data using a computer programming script that is 100% transparent and reproducible.

5.2. Creating a variable

In order to do that, we need to store our DataFrame so it can be reused in subsequent cells. We can do this by saving in a “variable”, which is a fancy computer programming word for a named shortcut where we save our work as we go.

Go back to your initial cell and change it to this. Then rerun it.

committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")

After you run it, you shouldn’t see anything. That’s a good thing. It means our DataFrame has been saved under the name committee_list, which we can now begin interacting with in the cells that follow.

We can do this by calling “methods” that pandas has made available to all DataFrames.

You may not have known it at the time, but read_csv was one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.

5.3. Using the head method

First, to preview the first few rows of the dataset, try the head method. Hit the + button in the toolbar to add a new cell below the first one. Type this in it and hit the run button again.

committee_list.head()
ocd_prop_id calaccess_prop_id ccdc_prop_id prop_name ccdc_committee_id calaccess_committee_id committee_name committee_position
0 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 382 1374469 YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... SUPPORT
1 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 383 1220380 COMMUNITY COLLEGE FACILITY COALITION ISSUES CO... SUPPORT
2 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 384 1282321 TORLAKSON'S INVEST IN CALIFORNIA A BALLOT MEAS... SUPPORT
3 ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 1376258 70 PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... 385 1382843 CALIFORNIA TAXPAYERS AND EDUCATORS OPPOSED TO ... OPPOSE
4 ocd-contest/85990193-9d6f-4600-b8e7-bf1317841d82 1362198 71 PROPOSITION 052 - STATE FEES ON HOSPITALS. FED... 386 1362973 YES ON PROPOSITION 52 - A COALITION OF CALIFOR... SUPPORT

5.4. Using the info method

To get a look at all of the columns and what type of data they store, add another cell and try info.

committee_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 8 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   ocd_prop_id             102 non-null    object
 1   calaccess_prop_id       102 non-null    int64 
 2   ccdc_prop_id            102 non-null    int64 
 3   prop_name               102 non-null    object
 4   ccdc_committee_id       102 non-null    int64 
 5   calaccess_committee_id  102 non-null    int64 
 6   committee_name          102 non-null    object
 7   committee_position      102 non-null    object
dtypes: int64(4), object(4)
memory usage: 6.5+ KB

Look carefully at those results and you see we have more than 100 links between committees and propositions.

5.5. Creating another DataFrame

With that we’re ready to move on to a related, similar task: Importing all of the individual contributions reported to last year’s 17 ballot measures.

We’ll start by using the read_csv method to import the second CSV file linked above. Save it as a new variable just as we did before. Let’s call this one contrib_list.

contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")

Just as we did earlier, you can inspect the contents of this new file with the head method.

contrib_list.head()
calaccess_committee_id committee_name calaccess_filing_id date_received contributor_lastname contributor_firstname contributor_city contributor_state contributor_zip contributor_employer contributor_occupation contributor_is_self_employed amount
0 1386560 ADULT USE CAMPAIGN FOR PROPOSITION 64; THE 2083796 2016-09-18 BERGMAN GRETCHEN SPRING VALLEY CA 91978 A NEW PATH EXECUTIVE DIRECTOR False 84.0
1 1386560 ADULT USE CAMPAIGN FOR PROPOSITION 64; THE 2083796 2016-09-18 KAHLE MYRNA SAN DIEGO CA 92109 NATIONAL SCHOOL DISTRICT TEACHER False 35.0
2 1386560 ADULT USE CAMPAIGN FOR PROPOSITION 64; THE 2083796 2016-07-15 MCDEVITT LEO ESCONDIDO CA 92025 LIFE IONIZERS SEO/CONTENT MANAGER False 198.0
3 1386560 ADULT USE CAMPAIGN FOR PROPOSITION 64; THE 2083796 2016-08-10 WARREN-SAMARIPA STEPHANIE SAN DIEGO CA 92116 STEPHANIE WARREN SAMARIPA ENTREPRENEUR False -50.0
4 1386560 ADULT USE CAMPAIGN FOR PROPOSITION 64; THE 2083796 2016-07-26 WARREN-SAMARIPA STEPHANIE SAN DIEGO CA 92116 STEPHANIE WARREN SAMARIPA ENTREPRENEUR True 50.0

You should also inspect the columns using the info method. Running these two tricks whenever you open a new file is a good habit to develop so that you can carefully examine the data you’re about to work with.

contrib_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56379 entries, 0 to 56378
Data columns (total 13 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   calaccess_committee_id        56379 non-null  int64  
 1   committee_name                56379 non-null  object 
 2   calaccess_filing_id           56379 non-null  int64  
 3   date_received                 56379 non-null  object 
 4   contributor_lastname          56375 non-null  object 
 5   contributor_firstname         53411 non-null  object 
 6   contributor_city              56369 non-null  object 
 7   contributor_state             56363 non-null  object 
 8   contributor_zip               56366 non-null  object 
 9   contributor_employer          48572 non-null  object 
 10  contributor_occupation        53273 non-null  object 
 11  contributor_is_self_employed  56379 non-null  bool   
 12  amount                        56379 non-null  float64
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 5.2+ MB

Now that you’ve got some data imported, we’re ready to begin our analysis.