6. Filter¶
The most common way to filter a DataFrame is to pass an expression as an “index” that can be used to decide which records to keep and which to discard. You write the expression by combining a column of your DataFrame with an “operator” like ==
or >
or <
and a value to compare each row against.
Note
If you are familiar with writing SQL to manipulate databases, pandas’ filtering system is somewhat similar to a WHERE query. The official pandas documentation offers direct translations between the two.
Let’s try filtering against the state
field. Save a state’s postal code into a variable. This will allow us to reuse it later.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
my_state = "IA"
In the next cell we will ask pandas to narrow down our list of accidents to just those in our state of interest. We will create a filter expression and place it between two square brackets following the DataFrame we wish to filter.
accident_list[accident_list["state"] == my_state]
event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | 20060705X00868 | BELL | 206B | CHI06FA173 | 2006 | 06/30/2006 00:00:00 | WALFORD | IA | USA | 1 | BELL | 206 | BELL 206 |
118 | 20130102X35708 | BELL HELICOPTER | 407 | CEN13FA122 | 2013 | 01/02/2013 00:00:00 | CLEAR LAKE | IA | USA | 3 | BELL | 407 | BELL 407 |
140 | 20140908X10448 | ROBINSON HELICOPTER COMPANY | R44 II | CEN14LA487 | 2014 | 09/06/2014 00:00:00 | MACEDONIA | IA | USA | 1 | ROBINSON | R44 | ROBINSON R44 |
Now we should save the results of that filter into a new variable separate from the full list we imported from the CSV file. Since it includes only accidents in our chosen state, let’s call it my_accidents
.
my_accidents = accident_list[accident_list["state"] == my_state]
To check our work and find out how many records are left after the filter, let’s run the DataFrame inspection commands we learned earlier.
First head
.
my_accidents.head()
event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
12 | 20060705X00868 | BELL | 206B | CHI06FA173 | 2006 | 06/30/2006 00:00:00 | WALFORD | IA | USA | 1 | BELL | 206 | BELL 206 |
118 | 20130102X35708 | BELL HELICOPTER | 407 | CEN13FA122 | 2013 | 01/02/2013 00:00:00 | CLEAR LAKE | IA | USA | 3 | BELL | 407 | BELL 407 |
140 | 20140908X10448 | ROBINSON HELICOPTER COMPANY | R44 II | CEN14LA487 | 2014 | 09/06/2014 00:00:00 | MACEDONIA | IA | USA | 1 | ROBINSON | R44 | ROBINSON R44 |
Then info
.
my_accidents.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 12 to 140
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event_id 3 non-null object
1 ntsb_make 3 non-null object
2 ntsb_model 3 non-null object
3 ntsb_number 3 non-null object
4 year 3 non-null int64
5 date 3 non-null object
6 city 3 non-null object
7 state 3 non-null object
8 country 3 non-null object
9 total_fatalities 3 non-null int64
10 latimes_make 3 non-null object
11 latimes_model 3 non-null object
12 latimes_make_and_model 3 non-null object
dtypes: int64(2), object(11)
memory usage: 336.0+ bytes
Now pick another state and try running the code again. See if you can write filters that will answer the following questions:
Which state recorded more accidents: Iowa or Missouri?
How many accidents recorded more than one fatality?
How many accidents happened in California in 2015?
What percentage of the total fatalities occured in California?
Once you’ve written code that generates the answers, you’re ready to move on to the next chapter.