8. Merge

Next we’ll cover how to merge two DataFrames together into a combined table. Before we can do that, we need to read in a second file. We’ll pull faa-survey.csv, which contains annual estimates of how many hours each type of helicopter was in the air. If we merge it with our accident totals, we will be able to calculate an accident rate.

We can read it in the same way as the NTSB accident list, with read_csv.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list['latimes_make_and_model'] = accident_list['latimes_make_and_model'].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")

Before you do anything, take a peek at it with with the head.

survey.head()
latimes_make_and_model total_hours
0 AGUSTA 109 362172
1 AIRBUS 130 1053786
2 AIRBUS 135 884596
3 AIRBUS 350 3883490
4 BELL 206 5501308

When joining two tables together, the first step is to look carefully at the columns in each table to find a common column that can be joined. We can do that with the info command we learned earlier.

accident_counts.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   latimes_make_and_model  12 non-null     object
 1   accidents               12 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 324.0+ bytes
survey.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 2 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   latimes_make_and_model  12 non-null     object
 1   total_hours             12 non-null     int64 
dtypes: int64(1), object(1)
memory usage: 324.0+ bytes

You can see that each table contains the latimes_make_and_model column. We can therefore join the two files using that column with the pandas merge method.

Note

If you are familar with traditional databases, you may recognize that the merge method in pandas is similar to SQL’s JOIN statement. If you dig into merge’s documentation you will see it has many of the same options.

Merging two DataFrames is as simple as passing both to pandas built-in merge method and specifying which field we’d like to use to connect them together. We will save the result into another new variable, which I’m going to call merged_list.

merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

Note

You may notice something new with the on="latimes_make_and_model" bit above. It is what Python calls a keyword argument. Keyword arguments are inputs passed to a function or method after explicitly specifying the name of the argument, followed by an equals sign.

Keyword arguments can be passed in any order, as long as the name of the argument is specified. When creating a function, they can be used to specify a default value for a parameter. For this reason, they are commonly to provide overrides of a method’s out-of-the-box behavior.

The pandas documentation for merge reveals all of the keyword options available, as well as their defaults.

That new DataFrame can be inspected like any other.

merged_list.head()
latimes_make_and_model accidents total_hours
0 AGUSTA 109 2 362172
1 AIRBUS 130 1 1053786
2 AIRBUS 135 4 884596
3 AIRBUS 350 29 3883490
4 BELL 206 30 5501308

By looking at the columns you can check how many rows survived the merge, a precaution you should take every time you join two tables.

merged_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 3 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   latimes_make_and_model  12 non-null     object
 1   accidents               12 non-null     int64 
 2   total_hours             12 non-null     int64 
dtypes: int64(2), object(1)
memory usage: 420.0+ bytes

You can also verify that the DataFrame has the same number of records as there are values in accident_totals column. That’s good; If there are no null values, that means that every record in each DataFrame found a match in the other.