5. Columns¶
We’ll begin with the latimes_make_and_model
column, which records the standardized name of each helicopter that crashed. To access its contents separate from the rest of the DataFrame, append a pair of square brackets with the column’s name in quotes inside.
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"]
0 BELL 407
1 ROBINSON R22
2 ROBINSON R44
3 ROBINSON R44
4 ROBINSON R44
...
158 BELL 407
159 SCHWEIZER 269
160 BELL 206
161 AIRBUS 350
162 ROBINSON R44
Name: latimes_make_and_model, Length: 163, dtype: object
That will list the column out as a Series
, just like the ones we created from scratch earlier. Just as we did then, you can now start tacking on additional methods that will analyze the contents of the column.
Note
You can also access columns a second way, like this: accident_list.latimes_make_and_model
. This method is quicker to type, but it won’t work if your column has a space in its name. So we’re teaching the universal bracket method instead.
5.1. Count a column’s values¶
In this case, the column is filled with characters. So we don’t want to calculate statistics like the median and average, as we did before.
There’s another built-in pandas tool that will total up the frequency of values in a column. The method is called value_counts
and it’s just as easy to use as sum
, min
or max
. All you need to do it is add a period after the column name and chain it on the tail end of your cell.
accident_list["latimes_make_and_model"].value_counts()
latimes_make_and_model
ROBINSON R44 38
AIRBUS 350 29
BELL 206 28
ROBINSON R22 20
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
SCHWEIZER 269 5
AIRBUS 135 4
bell 206 2
SIKORSKY 76 2
AGUSTA 109 2
AIRBUS 130 1
Name: count, dtype: int64
Congratulations, you’ve made your first finding. With that little line of code, you’ve calculated an important fact: During the period being studied, the Robinson R44 had more fatal accidents than any other helicopter.
But wait. Before we congratulate ourselves, let’s take a closer look at the data. Our value counts operation has turned up an imperfection that was buried in the data. Can you see it?
5.2. Cleaning data columns¶
On closer inspection, we can see that Bell 206 helicopter is listed two different ways, as BELL 206
and bell 206
. The variation in capitalization is causing pandas to treat them as two distinct values.
This is a common problem and a simple example of how “dirty” data can trip up a computer program. The solution is to clean up the column prior to analysis.
In this case, we can use the str
method, which is short for string. In many computer programming languages, string is the technical term used to refer to text. Thus, the pandas str
method is designed to manipulate a column of text. It can change the casing of text, find and replace different patterns and conduct many other useful operations.
You can access it by chaining .str
and your desired manipulation method after the column name. In this case, we want to use the upper
method, which will convert all of the text in the column to uppercase.
accident_list["latimes_make_and_model"].str.upper()
0 BELL 407
1 ROBINSON R22
2 ROBINSON R44
3 ROBINSON R44
4 ROBINSON R44
...
158 BELL 407
159 SCHWEIZER 269
160 BELL 206
161 AIRBUS 350
162 ROBINSON R44
Name: latimes_make_and_model, Length: 163, dtype: object
While it’s not useful in this case, we can try out the companion lower
method to see it do the opposite.
accident_list["latimes_make_and_model"].str.lower()
0 bell 407
1 robinson r22
2 robinson r44
3 robinson r44
4 robinson r44
...
158 bell 407
159 schweizer 269
160 bell 206
161 airbus 350
162 robinson r44
Name: latimes_make_and_model, Length: 163, dtype: object
Note
You can find a full list of str
methods, along with useful examples, in the pandas documentation.
To correct the bug, we need to assign the result of the upper
method to our existing column and overwrite what’s there. We can do that with the =
operator.
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
Now we can run value_counts
again to see if the problem has been fixed.
accident_list["latimes_make_and_model"].value_counts()
latimes_make_and_model
ROBINSON R44 38
BELL 206 30
AIRBUS 350 29
ROBINSON R22 20
BELL 407 13
HUGHES 369 13
MCDONNELL DOUGLAS 369 6
SCHWEIZER 269 5
AIRBUS 135 4
SIKORSKY 76 2
AGUSTA 109 2
AIRBUS 130 1
Name: count, dtype: int64
Much better! We have a clean list of helicopter models and their frequencies.
In the real world, you will almost always need to clean your data before you can analyze it, though the challenges will typically be more complex than this one. Pandas offers a wide range of tools to help you clean your data, but the basic process is always the same: Identify the problem, fix it, and then check your work. The value_counts
method is one of the most useful tools in this process.
Before we move on to the next chapter, here’s a challenge. See if you can answer a few more questions a journalist might ask about our dataset. All of the questions below can be answered using only tricks we’ve covered thus far.
What was the total number of fatalities?
Which helicopter maker had the most fatal accidents?
Which year had the most fatal helicopter accidents?
How many fatal helicopter accidents occurred in Texas?
How many different helicopter makers are in the dataset?
Once you’ve written code that generates the answers, you’re ready to move on to the next chapter.