7. Group

The groupby method allows you to group a DataFrame by a column and then calculate a sum, or any other statistic, for each unique value. This functions much like the “pivot table” feature found in most spreadsheets programs.

Let’s use it to total up the accidents by helicopter make and model. You start by passing the field you want to group on to the function.

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_list.groupby("latimes_make_and_model")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f8608512610>

That’s a nice start, but you’ll notice you don’t get much back. The data has been grouped, but we haven’t chosen what to do with it yet. If we want the total by model, we can use the size method.

accident_list.groupby("latimes_make_and_model").size()
latimes_make_and_model
AGUSTA 109                2
AIRBUS 130                1
AIRBUS 135                4
AIRBUS 350               29
BELL 206                 30
BELL 407                 13
HUGHES 369               13
MCDONNELL DOUGLAS 369     6
ROBINSON R22             20
ROBINSON R44             38
SCHWEIZER 269             5
SIKORSKY 76               2
dtype: int64

The result is much like value_counts, but we’re allowed run to all kinds of statistical operations on the group, like sum, mean and std. For instance, we could sum up the number of fatalities for each maker by stringing that field on the end followed by the statistical method.

accident_list.groupby("latimes_make_and_model")["total_fatalities"].sum()
latimes_make_and_model
AGUSTA 109                5
AIRBUS 130                1
AIRBUS 135               11
AIRBUS 350               81
BELL 206                 61
BELL 407                 35
HUGHES 369               19
MCDONNELL DOUGLAS 369     7
ROBINSON R22             27
ROBINSON R44             71
SCHWEIZER 269             7
SIKORSKY 76              11
Name: total_fatalities, dtype: int64

7.1. Reset a DataFrame

You may notice that even though the result of a groupby has two columns, pandas does not return a clean-looking table the same way other operations like head do. In most instances, you can convert ugly tables like the ones above into a pretty DataFrame by tacking the reset_index method onto the end of your code.

accident_list.groupby("latimes_make_and_model").size().reset_index()
latimes_make_and_model 0
0 AGUSTA 109 2
1 AIRBUS 130 1
2 AIRBUS 135 4
3 AIRBUS 350 29
4 BELL 206 30
5 BELL 407 13
6 HUGHES 369 13
7 MCDONNELL DOUGLAS 369 6
8 ROBINSON R22 20
9 ROBINSON R44 38
10 SCHWEIZER 269 5
11 SIKORSKY 76 2

Why doesn’t groupby return a DataFrame? Why does reset_index have such a weird name?

Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few.

As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved, you should consider contributing to the Python code that operates the library.

You can clean up the 0 column name assigned by pandas with the rename method.

accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()
latimes_make_and_model accidents
0 AGUSTA 109 2
1 AIRBUS 130 1
2 AIRBUS 135 4
3 AIRBUS 350 29
4 BELL 206 30
5 BELL 407 13
6 HUGHES 369 13
7 MCDONNELL DOUGLAS 369 6
8 ROBINSON R22 20
9 ROBINSON R44 38
10 SCHWEIZER 269 5
11 SIKORSKY 76 2

Now save that as a variable.

accident_counts = accident_list.groupby("latimes_make_and_model").size().rename("accidents").reset_index()

That will return a DataFrame with the accident totals we need to calculate a rate. Inspect it with head.

accident_counts.head()
latimes_make_and_model accidents
0 AGUSTA 109 2
1 AIRBUS 130 1
2 AIRBUS 135 4
3 AIRBUS 350 29
4 BELL 206 30

Now we‘ve got a ranking we can work with.