7. Group¶
The groupby
method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets programs.
Let’s use it to total up the accidents by helicopter make and model. You start by passing the field you want to group on to the function.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8608512610>
That’s a nice start, but you’ll notice you don’t get much back. The data has been grouped, but we haven’t chosen what to do with it yet. If we want the total by model, we can use the size
method.
accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109 2
AIRBUS 130 1
AIRBUS 135 4
AIRBUS 350 29
BELL 206 30
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
ROBINSON R22 20
ROBINSON R44 38
SCHWEIZER 269 5
SIKORSKY 76 2
dtype: int64
The result is much like value_counts
, but we’re allowed run to all kinds of statistical operations on the group, like sum
, mean
and std
. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method.
accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum()
latimes_make_and_model
AGUSTA 109 5
AIRBUS 130 1
AIRBUS 135 11
AIRBUS 350 81
BELL 206 61
BELL 407 35
HUGHES 369 19
MCDONNELL DOUGLAS 369 7
ROBINSON R22 27
ROBINSON R44 71
SCHWEIZER 269 7
SIKORSKY 76 11
Name: total_fatalities, dtype: int64
7.1. Reset a DataFrame¶
You may notice that even though the result of a groupby
has two columns, pandas does not return a clean-looking table the same way other operations like head
do. In most instances, you can convert ugly tables like the ones above into a pretty DataFrame by tacking the reset_index
method onto the end of your code.
accident_list.groupby("latimes_make_and_model").size().reset_index()
latimes_make_and_model | 0 | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
5 | BELL 407 | 13 |
6 | HUGHES 369 | 13 |
7 | MCDONNELL DOUGLAS 369 | 6 |
8 | ROBINSON R22 | 20 |
9 | ROBINSON R44 | 38 |
10 | SCHWEIZER 269 | 5 |
11 | SIKORSKY 76 | 2 |
Why doesn’t groupby
return a DataFrame? Why does reset_index
have such a weird name?
Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few.
As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved, you should consider contributing to the Python code that operates the library.
You can clean up the 0
column name assigned by pandas with the rename
method.
accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()
latimes_make_and_model | accidents | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
5 | BELL 407 | 13 |
6 | HUGHES 369 | 13 |
7 | MCDONNELL DOUGLAS 369 | 6 |
8 | ROBINSON R22 | 20 |
9 | ROBINSON R44 | 38 |
10 | SCHWEIZER 269 | 5 |
11 | SIKORSKY 76 | 2 |
Now save that as a variable.
accident_counts = accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()
That will return a DataFrame with the accident totals we need to calculate a rate. Inspect it with head
.
accident_counts.head()
latimes_make_and_model | accidents | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
Now we‘ve got a ranking we can work with.