11. Concatenate

Hide code cell content
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby("latimes_make_and_model").size().reset_index().rename(columns={0: "accidents"})
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")

Let’s briefly pretend we’ve already completed our analysis, visualized and published our findings. A year later, there’s more data. We want to follow up.

Let’s crack open the old notebook and see what we can do.

new_accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents-update.csv")

Now that we’ve imported it, let’s peek at the new data using our familiar head() method.

new_accident_list.head()
ntsb-model ntsb-number year date city state country total-fatalities latimes_make latimes_model latimes_make_and_model event_id ntsb_make
0 777 NYC07FA048 2024 12/14/2024 0:00 DAGSBORO DE USA 20 HELICOP 777 HELICOP 777 20241222X01838 HELICOP
1 777 LAX06LA257 2024 8/10/2024 0:00 TUCSON AZ USA 11 HELICOP 777 HELICOP 777 20240817X01187 HELICOP
2 777 MIA06FA039 2024 1/1/2024 0:00 GRAND RIDGE FL USA 202 HELICOP 777 HELICOP 777 20240111X00044 HELICOP

Okay, so we have three new accidents that we want to add in. But, hang on. Can you spot the differences in the data?

Our columns are in a slightly different order. This happens a lot from year to year in state datasets: a new data person starts, and they decide to reformat everything. But we’ve set everything up using the previous years’ formatting style.

Luckily, Pandas has a built-in method for dealing with problems like this: concat.

This method concatenates multiple DataFrames into a single DataFrame by combining their rows and columns.

updated_accident_list = pd.concat([accident_list, new_accident_list])

How do we check what this actually did? Let’s take a quick look at our columns.

count() gives us a brief overview of how many non-null values we have in each column. If we did this right and our data is complete, we should have the same number of values in every column.

updated_accident_list.count()
event_id                  166
ntsb_make                 166
ntsb_model                163
ntsb_number               163
year                      166
date                      166
city                      166
state                     165
country                   166
total_fatalities          163
latimes_make              166
latimes_model             166
latimes_make_and_model    166
ntsb-model                  3
ntsb-number                 3
total-fatalities            3
dtype: int64

Hang on. That doesn’t look right. Three values in ntsb-model, ntsb-number, and total-fatalities? It looks like we have the same total in other columns. But those three values should be combined with ntsb_model and so on, instead of on their own.

We need to change the column names from the 2024 data so they match previous years’.

Luckily, there’s a way to do that.

Write out any column names you want to be changed in this format:

bad_columns = {"ntsb-model" : "ntsb_model",
               "ntsb-number" : "ntsb_number",
               "total-fatalities" : "total_fatalities"}

This is called a dictionary. It’s a very useful data type built into Python. It basically functions like a list, except each list item has a key that maps to a value. In our case, we’ll be using the key to look for the current column name and the value to assign a new column name.

That means the old names go to the left of the colon, and the new names go to the right.

Note

Renaming columns in a dictionary can get really handy for larger datasets. You can even create a default, reusable dictionary that renames standard or common strings to your newsroom’s house style, or that changes all instances of “number” to “#”. This dataset is small, but when you start analyzing datasets with thousands of columns, this can save you a lot of pain.

Next, let’s grab the DataFrame we originally imported the 2024 data into (new_accident_list) and apply our column-name cleaner to it before we concatenate it with the other data.

(If you rename the columns after concatenating, you’ll just end up with two columns with the same name and different variables.)

new_accident_list = new_accident_list.rename(columns=bad_columns)

Okay, let’s see how that worked.

new_accident_list.count()
ntsb_model                3
ntsb_number               3
year                      3
date                      3
city                      3
state                     3
country                   3
total_fatalities          3
latimes_make              3
latimes_model             3
latimes_make_and_model    3
event_id                  3
ntsb_make                 3
dtype: int64

This looks more like what we want to put together.

Let’s try the concat step again.

updated_accident_list = pd.concat([accident_list, new_accident_list])

And check our work with count.

updated_accident_list.count()
event_id                  166
ntsb_make                 166
ntsb_model                166
ntsb_number               166
year                      166
date                      166
city                      166
state                     165
country                   166
total_fatalities          166
latimes_make              166
latimes_model             166
latimes_make_and_model    166
dtype: int64

We now have 166 complete values and no split columns. This also reorders your columns for you automatically, because concat matches on the column name. It defaults to the column order in the DataFrame on the left; if you wanted to keep the new_accident_list order, you’d have to swap the order in the concat command.

And now we can re-run our whole analysis with a fresh year of data added in — without changing anything except the name of the DataFrame it’s operating on.

We could also overwrite our original DataFrame with the new data so we don’t even have to change the DataFrame name to re-run the analysis. But that can be risky if you don’t keep track of your order of operations and leave good notes for yourself.