6. Filter¶
The most common way to filter a DataFrame is to pass an expression as an “index” that can be used to decide which records to keep and which to discard. You write the expression by combining a column of your DataFrame with an “operator” like == or > or < and a value to compare each row against.
Note
If you are familiar with writing SQL to manipulate databases, pandas’ filtering system is somewhat similar to a WHERE query. The official pandas documentation offers direct translations between the two.
Let’s try filtering against the state field. Save a state’s postal code into a variable. This will allow us to reuse it later.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
my_state = "IA"
In the next cell we will ask pandas to narrow down our list of accidents to just those in our state of interest. We will create a filter expression and place it between two square brackets following the DataFrame we wish to filter.
accident_list[accident_list["state"] == my_state]
| event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 20060705X00868 | BELL | 206B | CHI06FA173 | 2006 | 06/30/2006 00:00:00 | WALFORD | IA | USA | 1 | BELL | 206 | BELL 206 | 
| 118 | 20130102X35708 | BELL HELICOPTER | 407 | CEN13FA122 | 2013 | 01/02/2013 00:00:00 | CLEAR LAKE | IA | USA | 3 | BELL | 407 | BELL 407 | 
| 140 | 20140908X10448 | ROBINSON HELICOPTER COMPANY | R44 II | CEN14LA487 | 2014 | 09/06/2014 00:00:00 | MACEDONIA | IA | USA | 1 | ROBINSON | R44 | ROBINSON R44 | 
Now we should save the results of that filter into a new variable separate from the full list we imported from the CSV file. Since it includes only accidents in our chosen state, let’s call it my_accidents.
my_accidents = accident_list[accident_list["state"] == my_state]
To check our work and find out how many records are left after the filter, let’s run the DataFrame inspection commands we learned earlier.
First head.
my_accidents.head()
| event_id | ntsb_make | ntsb_model | ntsb_number | year | date | city | state | country | total_fatalities | latimes_make | latimes_model | latimes_make_and_model | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12 | 20060705X00868 | BELL | 206B | CHI06FA173 | 2006 | 06/30/2006 00:00:00 | WALFORD | IA | USA | 1 | BELL | 206 | BELL 206 | 
| 118 | 20130102X35708 | BELL HELICOPTER | 407 | CEN13FA122 | 2013 | 01/02/2013 00:00:00 | CLEAR LAKE | IA | USA | 3 | BELL | 407 | BELL 407 | 
| 140 | 20140908X10448 | ROBINSON HELICOPTER COMPANY | R44 II | CEN14LA487 | 2014 | 09/06/2014 00:00:00 | MACEDONIA | IA | USA | 1 | ROBINSON | R44 | ROBINSON R44 | 
Then info.
my_accidents.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 12 to 140
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   event_id                3 non-null      object
 1   ntsb_make               3 non-null      object
 2   ntsb_model              3 non-null      object
 3   ntsb_number             3 non-null      object
 4   year                    3 non-null      int64 
 5   date                    3 non-null      object
 6   city                    3 non-null      object
 7   state                   3 non-null      object
 8   country                 3 non-null      object
 9   total_fatalities        3 non-null      int64 
 10  latimes_make            3 non-null      object
 11  latimes_model           3 non-null      object
 12  latimes_make_and_model  3 non-null      object
dtypes: int64(2), object(11)
memory usage: 336.0+ bytes
Now pick another state and try running the code again. See if you can write filters that will answer the following questions:
- Which state recorded more accidents: Iowa or Missouri? 
- How many accidents recorded more than one fatality? 
- How many accidents happened in California in 2015? 
- What percentage of the total fatalities occured in California? 
Once you’ve written code that generates the answers, you’re ready to move on to the next chapter.