5. Columns¶
We’ll begin with the latimes_make_and_model
column, which records the standardized name of each helicopter that crashed. To access its contents separate from the rest of the DataFrame, append a pair of flat brackets with the column’s name in quotes inside.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list['latimes_make_and_model']
0 BELL 407
1 ROBINSON R22
2 ROBINSON R44
3 ROBINSON R44
4 ROBINSON R44
...
158 BELL 407
159 SCHWEIZER 269
160 BELL 206
161 AIRBUS 350
162 ROBINSON R44
Name: latimes_make_and_model, Length: 163, dtype: object
That will list the column out as a Series
, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.
Note
You can also access columns a second way, like this: accident_list.latimes_make_and_model
. This method is quicker to type, but it won’t work if your column has a space in its name. So we’re teaching the universal bracket method instead.
5.1. Count a column’s values¶
In this case, the column is filled with characters. So we don’t want to calculate statistics like the median and average, as we did before.
There’s another built-in pandas tool that will total up the frequency of values in a column. The method is called value_counts
and it’s just as easy to use as sum
, min
or max
. All you need to do it is add a period after the column name and chain it on the tail end of your cell.
accident_list['latimes_make_and_model'].value_counts()
ROBINSON R44 38
BELL 206 30
AIRBUS 350 29
ROBINSON R22 20
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
SCHWEIZER 269 5
AIRBUS 135 4
SIKORSKY 76 2
AGUSTA 109 2
AIRBUS 130 1
Name: latimes_make_and_model, dtype: int64
Congratulations, you’ve made your first finding. With that little line of code, you’ve calculated an important fact: During the period being studied, the Robinson R44 had more fatal accidents than any other helicopter.
5.2. Reset a DataFrame¶
You may notice that even though the result has two columns, pandas did not return a clean-looking table in the same way as head
did for our DataFrame. That’s because our column, a Series, acts a little bit different than the DataFrame created by read_csv
. In most instances, you can convert ugly Series into a pretty DataFrame by tacking on the reset_index
method on the end.
accident_list['latimes_make_and_model'].value_counts().reset_index()
index | latimes_make_and_model | |
---|---|---|
0 | ROBINSON R44 | 38 |
1 | BELL 206 | 30 |
2 | AIRBUS 350 | 29 |
3 | ROBINSON R22 | 20 |
4 | BELL 407 | 13 |
5 | HUGHES 369 | 13 |
6 | MCDONNELL DOUGLAS 369 | 6 |
7 | SCHWEIZER 269 | 5 |
8 | AIRBUS 135 | 4 |
9 | SIKORSKY 76 | 2 |
10 | AGUSTA 109 | 2 |
11 | AIRBUS 130 | 1 |
Why does a Series behave differently than a DataFrame? Why does reset_index
have such a weird name?
Like so much in computer programming, the answer is simply, “because the people who created the library said so.” It’s important to learn that all open-source programming tools are made by humans, and humans have their quirks. Over time you’ll see pandas has more than a few.
As a beginner, you should just accept the oddities and keep moving. As you get more advanced, if there’s something about the system you think could be improved you should consider contributing to the Python code that operates the library.
Before we move on to the next chapter, here’s a challenge. See if you can answer a few more questions a journalist might ask about our dataset. All four of the questions below can be answered using only tricks we’ve covered thus far. See if you can do it.
What was the total number of fatalities?
Which helicopter maker had the most accidents?
What was the total number of helicopter accidents by year?
Which state had the most helicopter accidents?
Once you’ve written code answered those, you’re ready to move on to the next chapter.