6. Filter

The most common way to filter a DataFrame is to pass an expression as an “index” that can be used to decide which records should be kept and which discarded. You write the expression by combining a column on your DataFrame with an “operator” like == or > or < and a value to compare against each row.

Note

If you are familiar with writing SQL to manipulate databases, pandas’ filtering system is somewhat similar to a WHERE query. The official pandas documentation offers direct translations between the two.

Let’s try filtering against the state field. Save a postal code into a variable. This will allow us to reuse it later.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list['latimes_make_and_model'] = accident_list['latimes_make_and_model'].str.upper()
my_state = "IA"

In the next cell we will ask pandas to narrow down our list of accidents to just those in the state we’re interested in. We will create a filter expression and place it between two flat brackets following the DataFrame we wish to filter.

accident_list[accident_list['state'] == my_state]
event_id ntsb_make ntsb_model ntsb_number year date city state country total_fatalities latimes_make latimes_model latimes_make_and_model
12 20060705X00868 BELL 206B CHI06FA173 2006 06/30/2006 00:00:00 WALFORD IA USA 1 BELL 206 BELL 206
118 20130102X35708 BELL HELICOPTER 407 CEN13FA122 2013 01/02/2013 00:00:00 CLEAR LAKE IA USA 3 BELL 407 BELL 407
140 20140908X10448 ROBINSON HELICOPTER COMPANY R44 II CEN14LA487 2014 09/06/2014 00:00:00 MACEDONIA IA USA 1 ROBINSON R44 ROBINSON R44

Now we should save the results of that filter into a new variable separate from the full list we imported from the CSV file. Since it includes only the sites for the state we want, let’s call it my_accidents.

my_accidents = accident_list[accident_list['state'] == my_state]

To check our work and find out how many records are left after the filter, let’s run the DataFrame inspection commands we learned earlier.

First head.

my_accidents.head()
event_id ntsb_make ntsb_model ntsb_number year date city state country total_fatalities latimes_make latimes_model latimes_make_and_model
12 20060705X00868 BELL 206B CHI06FA173 2006 06/30/2006 00:00:00 WALFORD IA USA 1 BELL 206 BELL 206
118 20130102X35708 BELL HELICOPTER 407 CEN13FA122 2013 01/02/2013 00:00:00 CLEAR LAKE IA USA 3 BELL 407 BELL 407
140 20140908X10448 ROBINSON HELICOPTER COMPANY R44 II CEN14LA487 2014 09/06/2014 00:00:00 MACEDONIA IA USA 1 ROBINSON R44 ROBINSON R44

Then info.

my_accidents.info()
<class 'pandas.core.frame.DataFrame'>
Index: 3 entries, 12 to 140
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   event_id                3 non-null      object
 1   ntsb_make               3 non-null      object
 2   ntsb_model              3 non-null      object
 3   ntsb_number             3 non-null      object
 4   year                    3 non-null      int64 
 5   date                    3 non-null      object
 6   city                    3 non-null      object
 7   state                   3 non-null      object
 8   country                 3 non-null      object
 9   total_fatalities        3 non-null      int64 
 10  latimes_make            3 non-null      object
 11  latimes_model           3 non-null      object
 12  latimes_make_and_model  3 non-null      object
dtypes: int64(2), object(11)
memory usage: 336.0+ bytes

Now pick another state and try running the code again. See if you can write filters that will answer the following questions:

  1. Which state recorded more accidents, Iowa or Missouri?

  2. How many accidents recorded more than one fatality?

  3. How many accidents happened in California in 2015?

  4. What percentage of the total fatalities occured in California?