9. Compute¶
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
To calculate an accident rate, we’ll need to create a new column based on the data in other columns, a process sometimes known as “computing.”
In many cases, it’s no more complicated than combining two Series using a mathematical operator. That’s true in this case, where our goal is to divide the total number of accidents in each row by the total hours. That can accomplished with the following:
merged_list["accidents"] / merged_list["total_hours"]
0 5.522238e-06
1 9.489593e-07
2 4.521838e-06
3 7.467510e-06
4 5.453249e-06
5 6.150096e-06
6 1.081812e-05
7 1.089544e-05
8 6.732180e-06
9 1.610354e-05
10 4.388560e-06
11 2.184563e-06
dtype: float64
The resulting Series can be added to your DataFrame by assigning it to a new column. Name your column by providing it as a quoted string inside of square brackets. Let’s call this column something brief and clear like per_hour
.
merged_list["per_hour"] = merged_list["accidents"] / merged_list["total_hours"]
Like everything else, you can inspect with the head
command.
merged_list.head()
latimes_make_and_model | accidents | total_hours | per_hour | |
---|---|---|---|---|
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 |
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 |
You can see that the result is in scientific notation. As is common when calculating per capita statistics, let’s multiple the per-hour results by a common number to make the figures more legible. That’s as easy as tacking some multiplication at the end of a computation. Here we’ll multiply by 100,000 hours.
merged_list["per_100k_hours"] = merged_list["per_hour"] * 100_000
Have a look at the result with head
again.
merged_list.head()
latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
---|---|---|---|---|---|
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
Much better! Now lets move on to the next step, sorting our data into a ranking.