10. Sort

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000

The sort_values method is how pandas does it. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field.

merged_list.sort_values("per_100k_hours")
latimes_make_and_model accidents total_hours per_hour per_100k_hours
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
4 BELL 206 30 5501308 5.453249e-06 0.545325
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
5 BELL 407 13 2113788 6.150096e-06 0.615010
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354

Note that by default sort_values returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called ascending. When it is set to False, the DataFrame is sorted in descending order.

merged_list.sort_values("per_100k_hours", ascending=False)
latimes_make_and_model accidents total_hours per_hour per_100k_hours
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
5 BELL 407 13 2113788 6.150096e-06 0.615010
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
4 BELL 206 30 5501308 5.453249e-06 0.545325
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896

Congratulations. With that, you’ve re-created the heart of the analysis published in the Los Angeles Times and covered most of the basic skills necessary to access and analyze data with pandas.

Before we move on, here’s another quiz for you. You can answer all of these questions using only tricks we’ve learned thus far.

  1. What’s the date of the most recent fatal helicopter accident in Texas?

  2. How many fatalities occurred in Texas accidents?

  3. What helicopter model logged the most flight hours?

  4. Where did the accident with the NTSB number ERA13LA057 occur?