11. Concatenate¶
Show code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
Let’s briefly pretend we’ve already completed our analysis, visualized and published our findings. A year later, there’s more data. We want to follow up.
Let’s crack open the old notebook and see what we can do.
new_accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents-update.csv")
Now that we’ve imported it, let’s peek at the new data using our familiar head()
method.
new_accident_list.head()
ntsb-model | ntsb-number | year | date | city | state | country | total-fatalities | latimes_make | latimes_model | latimes_make_and_model | event_id | ntsb_make | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 777 | NYC07FA048 | 2024 | 12/14/2024 0:00 | DAGSBORO | DE | USA | 20 | HELICOP | 777 | HELICOP 777 | 20241222X01838 | HELICOP |
1 | 777 | LAX06LA257 | 2024 | 8/10/2024 0:00 | TUCSON | AZ | USA | 11 | HELICOP | 777 | HELICOP 777 | 20240817X01187 | HELICOP |
2 | 777 | MIA06FA039 | 2024 | 1/1/2024 0:00 | GRAND RIDGE | FL | USA | 202 | HELICOP | 777 | HELICOP 777 | 20240111X00044 | HELICOP |
Okay, so we have three new accidents that we want to add in. But, hang on. Can you spot the differences in the data?
Our columns are in a slightly different order. This happens a lot from year to year in state datasets: a new data person starts, and they decide to reformat everything. But we’ve set everything up using the previous years’ formatting style.
Luckily, Pandas has a built-in method for dealing with problems like this: concat
.
This method concatenates multiple DataFrames into a single DataFrame by combining their rows and columns.
updated_accident_list = pd.concat([accident_list, new_accident_list])
How do we check what this actually did? Let’s take a quick look at our columns.
count()
gives us a brief overview of how many non-null values we have in each column. If we did this right and our data is complete, we should have the same number of values in every column.
updated_accident_list.count()
event_id 166
ntsb_make 166
ntsb_model 163
ntsb_number 163
year 166
date 166
city 166
state 165
country 166
total_fatalities 163
latimes_make 166
latimes_model 166
latimes_make_and_model 166
ntsb-model 3
ntsb-number 3
total-fatalities 3
dtype: int64
Hang on. That doesn’t look right. Three values in ntsb-model
, ntsb-number
, and total-fatalities
? It looks like we have the same total in other columns. But those three values should be combined with ntsb_model
and so on, instead of on their own.
We need to change the column names from the 2024 data so they match previous years’.
Luckily, there’s a way to do that.
Write out any column names you want to be changed in this format:
bad_columns = {"ntsb-model" : "ntsb_model",
"ntsb-number" : "ntsb_number",
"total-fatalities" : "total_fatalities"}
This is called a dictionary
. It’s a very useful data type built into Python. It basically functions like a list, except each list item has a key
that maps to a value
. In our case, we’ll be using the key
to look for the current column name and the value
to assign a new column name.
That means the old names go to the left of the colon, and the new names go to the right.
Note
Renaming columns in a dictionary can get really handy for larger datasets. You can even create a default, reusable dictionary that renames standard or common strings to your newsroom’s house style, or that changes all instances of “number” to “#”. This dataset is small, but when you start analyzing datasets with thousands of columns, this can save you a lot of pain.
Next, let’s grab the DataFrame we originally imported the 2024 data into (new_accident_list
) and apply our column-name cleaner to it before we concatenate it with the other data.
(If you rename the columns after concatenating, you’ll just end up with two columns with the same name and different variables.)
new_accident_list = new_accident_list.rename(columns=bad_columns)
Okay, let’s see how that worked.
new_accident_list.count()
ntsb_model 3
ntsb_number 3
year 3
date 3
city 3
state 3
country 3
total_fatalities 3
latimes_make 3
latimes_model 3
latimes_make_and_model 3
event_id 3
ntsb_make 3
dtype: int64
This looks more like what we want to put together.
Let’s try the concat
step again.
updated_accident_list = pd.concat([accident_list, new_accident_list])
And check our work with count
.
updated_accident_list.count()
event_id 166
ntsb_make 166
ntsb_model 166
ntsb_number 166
year 166
date 166
city 166
state 165
country 166
total_fatalities 166
latimes_make 166
latimes_model 166
latimes_make_and_model 166
dtype: int64
We now have 166 complete values and no split columns. This also reorders your columns for you automatically, because concat
matches on the column name. It defaults to the column order in the DataFrame on the left; if you wanted to keep the new_accident_list
order, you’d have to swap the order in the concat
command.
And now we can re-run our whole analysis with a fresh year of data added in — without changing anything except the name of the DataFrame it’s operating on.
We could also overwrite our original DataFrame with the new data so we don’t even have to change the DataFrame name to re-run the analysis. But that can be risky if you don’t keep track of your order of operations and leave good notes for yourself.