10. Sort¶
Another simple but common technique for analyzing data is sorting. This can be useful for ranking the DataFrame to show the first and last members of the table according to a particular column.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000
The sort_values
method is how pandas does it. It expects you to provide it with the name of the column to sort by in quotes. Try sorting by our computed field.
merged_list.sort_values("per_100k_hours")
latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
---|---|---|---|---|---|
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
Note that by default sort_values
returns the DataFrame sorted in ascending order from lowest to highest. You can show the largest values first by passing in an optional keyword argument called ascending
. When it is set to False
, the DataFrame is sorted in descending order.
merged_list.sort_values("per_100k_hours", ascending=False)
latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | |
---|---|---|---|---|---|
9 | ROBINSON R44 | 38 | 2359729 | 1.610354e-05 | 1.610354 |
7 | MCDONNELL DOUGLAS 369 | 6 | 550689 | 1.089544e-05 | 1.089544 |
6 | HUGHES 369 | 13 | 1201688 | 1.081812e-05 | 1.081812 |
3 | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 |
8 | ROBINSON R22 | 20 | 2970806 | 6.732180e-06 | 0.673218 |
5 | BELL 407 | 13 | 2113788 | 6.150096e-06 | 0.615010 |
0 | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 |
4 | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 |
2 | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 |
10 | SCHWEIZER 269 | 5 | 1139326 | 4.388560e-06 | 0.438856 |
11 | SIKORSKY 76 | 2 | 915515 | 2.184563e-06 | 0.218456 |
1 | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 |
Congratulations. With that, you’ve re-created the heart of the analysis published in the Los Angeles Times and covered most of the basic skills necessary to access and analyze data with pandas.
Before we move on, here’s another quiz for you. You can answer all of these questions using only tricks we’ve learned thus far.
What’s the date of the most recent fatal helicopter accident in Texas?
How many fatalities occurred in Texas accidents?
What helicopter model logged the most flight hours?
Where did the accident with the NTSB number
ERA13LA057
occur?