5. Data¶
Now it’s time to get our hands on some real data.
The coalition has created simplified data files containing the disclosure forms that committees campaigning either for against one of the 17 propositions on the ballot in November 2016 filed with the state of California.
They are:
Name |
Description |
---|---|
Committees active in the election linked to propositions supported or opposed |
|
Donors reported by each of the committees |
The data are structured in rows of comma-separated values. This is known as a CSV file. It is the most common way you will find data published online.
5.1. Creating a DataFrame¶
The pandas library is able to read in files from a variety formats, including CSV.
If it’s not currently running start up your Jupyter Notebook as described in chapter two.
Scroll down to the first open cell. There we will import the first CSV file listed above using the read_csv function included with pandas.
import pandas as pd
pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
ocd_prop_id | calaccess_prop_id | ccdc_prop_id | prop_name | ccdc_committee_id | calaccess_committee_id | committee_name | committee_position | |
---|---|---|---|---|---|---|---|---|
0 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT |
1 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 383 | 1220380 | COMMUNITY COLLEGE FACILITY COALITION ISSUES CO... | SUPPORT |
2 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 384 | 1282321 | TORLAKSON'S INVEST IN CALIFORNIA A BALLOT MEAS... | SUPPORT |
3 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 385 | 1382843 | CALIFORNIA TAXPAYERS AND EDUCATORS OPPOSED TO ... | OPPOSE |
4 | ocd-contest/85990193-9d6f-4600-b8e7-bf1317841d82 | 1362198 | 71 | PROPOSITION 052 - STATE FEES ON HOSPITALS. FED... | 386 | 1362973 | YES ON PROPOSITION 52 - A COALITION OF CALIFOR... | SUPPORT |
... | ... | ... | ... | ... | ... | ... | ... | ... |
97 | ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 | 1372638 | 86 | PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... | 479 | 1329332 | SAVE THE BAG BAN, YES ON 67, SPONSORED BY ENVI... | SUPPORT |
98 | ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 | 1372638 | 86 | PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... | 480 | 1374885 | YES ON 67 - CALIFORNIANS AGAINST WASTE - PROTE... | SUPPORT |
99 | ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 | 1372638 | 86 | PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... | 481 | 1346973 | CALIFORNIA GROCERS ASSOCIATION ISSUES COMMITTE... | SUPPORT |
100 | ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 | 1372638 | 86 | PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... | 482 | 1381938 | SAVE THE BAY ACTION FUND PAC - YES ON PROP 67 | SUPPORT |
101 | ocd-contest/7495cdbe-1aa7-4c26-9a55-aa4130347b95 | 1372638 | 86 | PROPOSITION 067- REFERENDUM TO OVERTURN BAN ON... | 483 | 1389817 | SAVE THE BAY ACTION FUND COMMITTEE TO SUPPORT ... | SUPPORT |
102 rows × 8 columns
Warning
You will need to precisely type in the URL to the file. Feel free to copy and paste it from the example above into your notebook.
After you run the cell, you should see a big table printed below.
It is a DataFrame where pandas has structured the CSV data into rows and columns, just like Excel or other spreadsheet software might.
The advantage here is that rather than manipulating the data through a haphazard series of clicks and keypunches we will be gradually grinding down the data using a computer programming script that is 100% transparent and reproducible.
5.2. Creating a variable¶
In order to do that, we need to store our DataFrame so it can be reused in subsequent cells. We can do this by saving in a “variable”, which is a fancy computer programming word for a named shortcut where we save our work as we go.
Go back to your initial cell and change it to this. Then rerun it.
committee_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/committees.csv")
After you run it, you shouldn’t see anything. That’s a good thing. It means our DataFrame has been saved under the name committee_list
, which we can now begin interacting with in the cells that follow.
We can do this by calling “methods” that pandas has made available to all DataFrames.
You may not have known it at the time, but read_csv
was one of these methods. There are dozens more that can do all sorts of interesting things. Let’s start with some easy ones that analysts use all the time.
5.3. Using the head method¶
First, to preview the first few rows of the dataset, try the head method. Hit the +
button in the toolbar to add a new cell below the first one. Type this in it and hit the run button again.
committee_list.head()
ocd_prop_id | calaccess_prop_id | ccdc_prop_id | prop_name | ccdc_committee_id | calaccess_committee_id | committee_name | committee_position | |
---|---|---|---|---|---|---|---|---|
0 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 382 | 1374469 | YES ON PROPOSITION 51 - CALIFORNIANS FOR QUALI... | SUPPORT |
1 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 383 | 1220380 | COMMUNITY COLLEGE FACILITY COALITION ISSUES CO... | SUPPORT |
2 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 384 | 1282321 | TORLAKSON'S INVEST IN CALIFORNIA A BALLOT MEAS... | SUPPORT |
3 | ocd-contest/b51dc64d-3562-4913-a190-69f5088c22a6 | 1376258 | 70 | PROPOSITION 051 - SCHOOL BONDS. FUNDING FOR K-... | 385 | 1382843 | CALIFORNIA TAXPAYERS AND EDUCATORS OPPOSED TO ... | OPPOSE |
4 | ocd-contest/85990193-9d6f-4600-b8e7-bf1317841d82 | 1362198 | 71 | PROPOSITION 052 - STATE FEES ON HOSPITALS. FED... | 386 | 1362973 | YES ON PROPOSITION 52 - A COALITION OF CALIFOR... | SUPPORT |
5.4. Using the info method¶
To get a look at all of the columns and what type of data they store, add another cell and try info.
committee_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102 entries, 0 to 101
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 ocd_prop_id 102 non-null object
1 calaccess_prop_id 102 non-null int64
2 ccdc_prop_id 102 non-null int64
3 prop_name 102 non-null object
4 ccdc_committee_id 102 non-null int64
5 calaccess_committee_id 102 non-null int64
6 committee_name 102 non-null object
7 committee_position 102 non-null object
dtypes: int64(4), object(4)
memory usage: 6.5+ KB
Look carefully at those results and you see we have more than 100 links between committees and propositions.
5.5. Creating another DataFrame¶
With that we’re ready to move on to a related, similar task: Importing all of the individual contributions reported to last year’s 17 ballot measures.
We’ll start by using the read_csv
method to import the second CSV file linked above. Save it as a new variable just as we did before. Let’s call this one contrib_list
.
contrib_list = pd.read_csv("https://raw.githubusercontent.com/california-civic-data-coalition/first-python-notebook/master/docs/src/_static/contributions.csv")
Just as we did earlier, you can inspect the contents of this new file with the head method.
contrib_list.head()
calaccess_committee_id | committee_name | calaccess_filing_id | date_received | contributor_lastname | contributor_firstname | contributor_city | contributor_state | contributor_zip | contributor_employer | contributor_occupation | contributor_is_self_employed | amount | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1386560 | ADULT USE CAMPAIGN FOR PROPOSITION 64; THE | 2083796 | 2016-09-18 | BERGMAN | GRETCHEN | SPRING VALLEY | CA | 91978 | A NEW PATH | EXECUTIVE DIRECTOR | False | 84.0 |
1 | 1386560 | ADULT USE CAMPAIGN FOR PROPOSITION 64; THE | 2083796 | 2016-09-18 | KAHLE | MYRNA | SAN DIEGO | CA | 92109 | NATIONAL SCHOOL DISTRICT | TEACHER | False | 35.0 |
2 | 1386560 | ADULT USE CAMPAIGN FOR PROPOSITION 64; THE | 2083796 | 2016-07-15 | MCDEVITT | LEO | ESCONDIDO | CA | 92025 | LIFE IONIZERS | SEO/CONTENT MANAGER | False | 198.0 |
3 | 1386560 | ADULT USE CAMPAIGN FOR PROPOSITION 64; THE | 2083796 | 2016-08-10 | WARREN-SAMARIPA | STEPHANIE | SAN DIEGO | CA | 92116 | STEPHANIE WARREN SAMARIPA | ENTREPRENEUR | False | -50.0 |
4 | 1386560 | ADULT USE CAMPAIGN FOR PROPOSITION 64; THE | 2083796 | 2016-07-26 | WARREN-SAMARIPA | STEPHANIE | SAN DIEGO | CA | 92116 | STEPHANIE WARREN SAMARIPA | ENTREPRENEUR | True | 50.0 |
You should also inspect the columns using the info method. Running these two tricks whenever you open a new file is a good habit to develop so that you can carefully examine the data you’re about to work with.
contrib_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 56379 entries, 0 to 56378
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 calaccess_committee_id 56379 non-null int64
1 committee_name 56379 non-null object
2 calaccess_filing_id 56379 non-null int64
3 date_received 56379 non-null object
4 contributor_lastname 56375 non-null object
5 contributor_firstname 53411 non-null object
6 contributor_city 56369 non-null object
7 contributor_state 56363 non-null object
8 contributor_zip 56366 non-null object
9 contributor_employer 48572 non-null object
10 contributor_occupation 53273 non-null object
11 contributor_is_self_employed 56379 non-null bool
12 amount 56379 non-null float64
dtypes: bool(1), float64(1), int64(2), object(9)
memory usage: 5.2+ MB
Now that you’ve got some data imported, we’re ready to begin our analysis.