7. Group¶
The groupby
method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets.
Let’s use it to total up the accidents by make and model. You start by passing the field you want to group on to the function.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f296c6d85e0>
A nice start but you’ll notice you don’t get much back. The data’s been grouped, but we haven’t chosen what to do with it yet. If we wanted the total by model, we would use the size
method.
accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109 2
AIRBUS 130 1
AIRBUS 135 4
AIRBUS 350 29
BELL 206 30
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
ROBINSON R22 20
ROBINSON R44 38
SCHWEIZER 269 5
SIKORSKY 76 2
dtype: int64
The result is much like value_counts
, but we’re allowed run to all kinds of statistical operations on the group, like sum
, mean
and std
. For instance, we could sum the total number of fatalities for each maker by stringing that field on the end followed by the statistical method.
accident_list.groupby("latimes_make_and_model")['total_fatalities'].sum()
latimes_make_and_model
AGUSTA 109 5
AIRBUS 130 1
AIRBUS 135 11
AIRBUS 350 81
BELL 206 61
BELL 407 35
HUGHES 369 19
MCDONNELL DOUGLAS 369 7
ROBINSON R22 27
ROBINSON R44 71
SCHWEIZER 269 7
SIKORSKY 76 11
Name: total_fatalities, dtype: int64
Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the reset_index
method again.
accident_list.groupby("latimes_make_and_model").size().reset_index()
latimes_make_and_model | 0 | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
5 | BELL 407 | 13 |
6 | HUGHES 369 | 13 |
7 | MCDONNELL DOUGLAS 369 | 6 |
8 | ROBINSON R22 | 20 |
9 | ROBINSON R44 | 38 |
10 | SCHWEIZER 269 | 5 |
11 | SIKORSKY 76 | 2 |
You can clean up the 0
column name assigned by pandas with the rename
method.
accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()
latimes_make_and_model | accidents | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
5 | BELL 407 | 13 |
6 | HUGHES 369 | 13 |
7 | MCDONNELL DOUGLAS 369 | 6 |
8 | ROBINSON R22 | 20 |
9 | ROBINSON R44 | 38 |
10 | SCHWEIZER 269 | 5 |
11 | SIKORSKY 76 | 2 |
Now save that as a variable.
accident_counts = accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()
The result is a DataFrame with the accident totals we’ll want to merge with the FAA survey data to calculate rates.
Note
You may notice that we’ve configured rename
differently than other methods so far. These are what Python calls keyword arguments. They are inputs that are passed to a function or method by explicitly specifying the name of the argument, followed by an equal sign and the value being passed. Here we used columns
and inplace
.
Keyword arguments are different from positional arguments, which are passed to a function in the order that they are defined. Keyword arguments can be useful because they make it clear which argument is being passed and also they can be passed in any order, as long as the name of the argument is specified.
Also, keyword arguments are often used to specify default values for a function’s parameters, this way if an argument is not passed, the default value will be used. For this reason, they are often used in pandas to override the default behavior and provide customizations beyond the out-of-the-box behavior of a method. Visiting the pandas documentation for any method, here’s the page for rename
, will reveal what options are available.
To see the result, inspect the DataFrame with head
.
accident_counts.head()
latimes_make_and_model | accidents | |
---|---|---|
0 | AGUSTA 109 | 2 |
1 | AIRBUS 130 | 1 |
2 | AIRBUS 135 | 4 |
3 | AIRBUS 350 | 29 |
4 | BELL 206 | 30 |
Now we‘ve got a ranking we can work with.