10. Sort

Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list['latimes_make_and_model'] = accident_list['latimes_make_and_model'].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
merged_list['per_hour'] = merged_list.accidents / merged_list.total_hours
merged_list['per_100k_hours'] = (merged_list.accidents / merged_list.total_hours) * 100_000

The sort_values method is how pandas does it. It expects that you provide it with the name of the column to sort by in quotes. Try sorting by our computed field.

merged_list.sort_values("per_100k_hours")
latimes_make_and_model accidents total_hours per_hour per_100k_hours
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
4 BELL 206 30 5501308 5.453249e-06 0.545325
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
5 BELL 407 13 2113788 6.150096e-06 0.615010
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354

Note that returns the DataFrame resorted in ascending order from lowest to highest. That is pandas’ default way of sorting. You reverse it to show the largest values first by passing in an optional keyword argument called ascending. When it is set to False, the DataFrame is sorted in descending order.

merged_list.sort_values("per_100k_hours", ascending=False)
latimes_make_and_model accidents total_hours per_hour per_100k_hours
9 ROBINSON R44 38 2359729 1.610354e-05 1.610354
7 MCDONNELL DOUGLAS 369 6 550689 1.089544e-05 1.089544
6 HUGHES 369 13 1201688 1.081812e-05 1.081812
3 AIRBUS 350 29 3883490 7.467510e-06 0.746751
8 ROBINSON R22 20 2970806 6.732180e-06 0.673218
5 BELL 407 13 2113788 6.150096e-06 0.615010
0 AGUSTA 109 2 362172 5.522238e-06 0.552224
4 BELL 206 30 5501308 5.453249e-06 0.545325
2 AIRBUS 135 4 884596 4.521838e-06 0.452184
10 SCHWEIZER 269 5 1139326 4.388560e-06 0.438856
11 SIKORSKY 76 2 915515 2.184563e-06 0.218456
1 AIRBUS 130 1 1053786 9.489593e-07 0.094896

Congratulations. With that, you’ve recreated the heart of the analysis published in the Los Angeles Times and covered most of the basic skills necessary to access and analyze data with pandas. In our next chapter, we will show another pandas trick that’s sure to come in handy.