7. Group

The groupby method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets.

Let’s use it to total up the accidents by make and model. You start by passing the field you want to group on to the function.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f958b12a490>

A nice start but you’ll notice you don’t get much back. The data’s been grouped, but we haven’t chosen what to do with it yet. If we wanted the total by model, we would use the size method.

accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109                2
AIRBUS 130                1
AIRBUS 135                4
AIRBUS 350               29
BELL 206                 30
BELL 407                 13
HUGHES 369               13
MCDONNELL DOUGLAS 369     6
ROBINSON R22             20
ROBINSON R44             38
SCHWEIZER 269             5
SIKORSKY 76               2
dtype: int64

The result is much like value_counts, but we’re allowed run to all kinds of statistical operations on the group, like sum, mean and std. For instance, we could sum the total number of fatalities for each maker by stringing that field on the end followed by the statistical method.

accident_list.groupby("latimes_make_and_model").total_fatalities.sum()
latimes_make_and_model
AGUSTA 109                5
AIRBUS 130                1
AIRBUS 135               11
AIRBUS 350               81
BELL 206                 61
BELL 407                 35
HUGHES 369               19
MCDONNELL DOUGLAS 369     7
ROBINSON R22             27
ROBINSON R44             71
SCHWEIZER 269             7
SIKORSKY 76              11
Name: total_fatalities, dtype: int64

Again our data has come back as an ugly Series. To reformat it as a pretty DataFrame use the reset_index method again.

accident_list.groupby("latimes_make_and_model").size().reset_index()
latimes_make_and_model 0
0 AGUSTA 109 2
1 AIRBUS 130 1
2 AIRBUS 135 4
3 AIRBUS 350 29
4 BELL 206 30
5 BELL 407 13
6 HUGHES 369 13
7 MCDONNELL DOUGLAS 369 6
8 ROBINSON R22 20
9 ROBINSON R44 38
10 SCHWEIZER 269 5
11 SIKORSKY 76 2

Now save that as a variable.

accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index()

You can clean up the 0 column name assigned by pandas with the rename method. The inplace option, found on many pandas methods, will save the change to your variable automatically. When you execute a cell, the code or text inside it will be processed and the output will be displayed below the cell. If the cell contains code, it will be executed by the kernel (the Jupyter notebook’s underlying programming language, such as Python) and any output generated by the code will be displayed.

accident_counts.rename(columns={0: "accidents"}, inplace=True)

The result is a DataFrame with the accident totals we’ll want to merge with the FAA survey data to calculate rates.

Note

You may notice that we’ve configured rename differently than other methods so far. These are what Python calls keyword arguments. They are inputs that are passed to a function or method by explicitly specifying the name of the argument, followed by an equal sign and the value being passed. Here we used columns and inplace.

Keyword arguments are different from positional arguments, which are passed to a function in the order that they are defined. Keyword arguments can be useful because they make it clear which argument is being passed and also they can be passed in any order, as long as the name of the argument is specified.

Also, keyword arguments are often used to specify default values for a function’s parameters, this way if an argument is not passed, the default value will be used. For this reason, they are often used in pandas to override the default behavior and provide customizations beyond the out-of-the-box behavior of a method. Visiting the pandas documentation for any method, here’s the page for rename, will reveal what options are available.

To see the result, inspect the DataFrame with head.

accident_counts.head()
latimes_make_and_model accidents
0 AGUSTA 109 2
1 AIRBUS 130 1
2 AIRBUS 135 4
3 AIRBUS 350 29
4 BELL 206 30

Now we‘ve got a ranking we can work with.