9. Compute

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list['latimes_make_and_model'] = accident_list['latimes_make_and_model'].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey['latimes_make_and_model'] = survey['latimes_make_and_model'].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

To calculate an accident rate, we’ll need to create a new column based on the data in other columns, a process sometimes known as “computing.”

In many cases, it’s no more complicated than combining two series using a mathematical operator. That’s true in this case, where our goal is to divide the total number of accidents in each row into the total hours. That can accomplished with the following:

merged_list['accidents'] / merged_list['total_hours']
0     5.522238e-06
1     9.489593e-07
2     4.521838e-06
3     7.467510e-06
4     5.453249e-06
5     6.150096e-06
6     1.081812e-05
7     1.089544e-05
8     6.732180e-06
9     1.610354e-05
10    4.388560e-06
11    2.184563e-06
dtype: float64

The resulting series can be added to your dataframe by assigning it to a new column. You name your column by providing it as a quoted string inside of flat brackets. Let’s call this column something brief and clear like per_hour.

merged_list['per_hour'] = merged_list['accidents'] / merged_list['total_hours']

Which, like everything else, you can inspect with the head command.

merged_list.head()
latimes_make_and_model accidents total_hours per_hour
0 AGUSTA 109 2 362172 5.522238e-06
1 AIRBUS 130 1 1053786 9.489593e-07
2 AIRBUS 135 4 884596 4.521838e-06
3 AIRBUS 350 29 3883490 7.467510e-06
4 BELL 206 30 5501308 5.453249e-06

You can see that the result is in scientific notation. As is common when calculating per capita statistics, you can multiple all results by a common number to make the numbers more legible. That’s as easy as tacking on the multiplication at the end of a computation. Here we’ll use 100,000.

merged_list['per_100k_hours'] = merged_list['per_hour'] * 100_000

Have a look at the result with head again.

merged_list.head()
latimes_make_and_model accidents total_hours per_hour per_100k_hours
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
4 BELL 206 30 5501308 5.453249e-06 0.545325

Much better! Now lets move on to the next step, sorting our data into a ranking.