10. Sort¶
Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
merged_list['per_hour'] = merged_list.accidents / merged_list.total_hours
merged_list['per_100k_hours'] = (merged_list.accidents / merged_list.total_hours) * 100_000
The sort_values
method is how pandas does it. It expects that you provide it with the name of the column to sort by in quotes. Try sorting by our computed field.
merged_list.sort_values("per_100k_hours")
latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
---|---|---|---|---|---|
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
Note that returns the DataFrame resorted in ascending order from lowest to highest. That is pandas’ default way of sorting. You reverse it to show the largest values first by passing in an optional keyword argument called ascending
. When it is set to False
, the DataFrame is sorted in descending order.
merged_list.sort_values("per_100k_hours", ascending=False)
latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
---|---|---|---|---|---|
9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
Congratulations. With that, you’ve recreated the heart of the analysis published in the Los Angeles Times and covered most of the basic skills necessary to access and analyze data with pandas. In our next chapter, we will show how you can chart the data with Altair.