First Python Notebook

12. Charts¶

Python has a number of charting tools that can work hand-in-hand with pandas. While Altair is a relatively new package compared to classics like matplotlib, it has great documentation and is easy to configure. Let’s take it for a spin.

12.1. Make a basic bar chart¶

The first thing we need to do is import Altair. In the tradition of pandas, we’ll import it with the alias alt to reduce how much we need to type later on.

import altair as alt

Note

If the import triggers an error that says your notebook doesn’t have Altair, you can install it by running %pip install altair in a cell. This will download and install the library using the pip package manager and Jupyter’s built-in magic command.

In a typical analysis, you’d import all of your libraries in one cell at the top of the file. That way, if you need to install or make changes to the packages a notebook uses, you know where to find them and you won’t hit errors importing a package midway through running a file.

With Altair imported, we can now feed it our DataFrame to make a simple bar chart. Let’s take a look at the basic building block of an Altair chart: the Chart object. We’ll tell it that we want to create a chart from merged_list by passing the DataFrame in, like so:

alt.Chart(merged_list)

---------------------------------------------------------------------------
SchemaValidationError                     Traceback (most recent call last)
File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/vegalite/v5/api.py:4033, in Chart.to_dict(self, validate, format, ignore, context)
   4031     copy.data = core.InlineData(values=[{}])
   4032     return super(Chart, copy).to_dict(**kwds)
-> 4033 return super().to_dict(**kwds)

File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/vegalite/v5/api.py:2004, in TopLevelMixin.to_dict(self, validate, format, ignore, context)
   2001 # remaining to_dict calls are not at top level
   2002 context["top_level"] = False
-> 2004 vegalite_spec: Any = _top_schema_base(super(TopLevelMixin, copy)).to_dict(
   2005     validate=validate, ignore=ignore, context=dict(context, pre_transform=False)
   2006 )
   2008 # TODO: following entries are added after validation. Should they be validated?
   2009 if is_top_level:
   2010     # since this is top-level we add $schema if it's missing

File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/utils/schemapi.py:1169, in SchemaBase.to_dict(self, validate, ignore, context)
   1167         self.validate(result)
   1168     except jsonschema.ValidationError as err:
-> 1169         raise SchemaValidationError(self, err) from None
   1170 return result

SchemaValidationError: '{'data': {'name': 'data-37aa3d2cc96e41928ba6304b789aa722'}}' is an invalid value.

'mark' is a required property

alt.Chart(...)

OK! We got an error, but don’t panic. The error says that Altair needs a “mark” — that is to say, it needs to know not only what data we want to visualize, but also how to represent that data visually. There are lots of different marks that Altair can use (you can check them all out here). But let’s try out the most versatile mark in our visualization toolbox: the bar.

alt.Chart(merged_list).mark_bar()

That’s an improvement, but we’ve got a new error: Altair doesn’t know which columns of our DataFrame to look at! At a minimum, we also need to define the column to use for the x- and y-axes. We can do that by chaining in the encode method.

alt.Chart(merged_list).mark_bar().encode(
    x="latimes_make_and_model",
    y="per_100k_hours"
)

That’s more like it!

Here’s an idea — maybe we do horizontal bars instead of vertical. How would you rewrite this chart code to reverse those bars?

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y="latimes_make_and_model"
)

This chart is an okay start, but it’s sorted alphabetically by y-axis value, which is pretty sloppy and hard to visually parse. Let’s fix that.

We want to sort the y-axis values by their corresponding x values. We know how to do that in Pandas, but Altair has its own opinions about how to sort a DataFrame, so it will override any sort order on the DataFrame we pass in.

Until now, we’ve been using the shorthand syntax to create our axes, but to add more customization to our chart we’ll have to switch to the longform way of defining the y-axis.

To do that, we’ll use a syntax like this: alt.Y(column_name). Instead of passing a string to y and letting Altair do the rest, this lets us create a y-axis object and then give it additional instructions.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model")
)

This chart should look identical to our previous attempt when we created the y-axis the simpler way, but it opens up new options! Now we can instruct Altair to sort the y-axis by the x-axis values.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("x")
)

That’s looking a lot neater! By default, the sort order will be small to large. Visually, if we want to feature the highest accident rates, it probably makes sense to reverse that order. We can do that by adding a minus before the axis name.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-x")
)

And we can’t have a chart without context. Let’s throw in a title for good measure.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-x")
).properties(
    title="Helicopter accident rates"
)

Yay, we made a chart!

12.2. Other marks¶

What if we wanted to switch it up and show this data in a slightly different form? For example, in the Los Angeles Times story, the fatal accident rate is shown as a scaled circle.

We can try that out with just a few small tweaks, using Altair’s mark_circle option. We’ll keep the y encoding, since we still want to split out our chart by make and model. Instead of an x encoding, though, we’ll pass in a size encoding, which will pin the radius of each circle to that rate calculation. And hey, while we’re at it, let’s throw in an interactive tooltip that displays the accident rate when users hover over a mark.

alt.Chart(merged_list).mark_circle().encode(
    size="per_100k_hours",
    y="latimes_make_and_model",
    tooltip="per_100k_hours"
)

A nice little change from all the bar charts! But once again, the default sorting alphabetical by name. Instead, it would be really nice to sort this by rate, as we did with the bar chart. How would we go about that?

alt.Chart(merged_list).mark_circle().encode(
    size="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-size"),
    tooltip="per_100k_hours"
)

12.3. `datetime` data¶

One thing you’ll almost certainly find yourself grappling with time and time again is date (and time) fields, so let’s talk about how to handle them.

Let’s see if we can do that with our original DataFrame, the accident_list that contains one record for every helicopter accident. We can remind ourselves what it contains with the info command.

accident_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   event_id                163 non-null    object
 1   ntsb_make               163 non-null    object
 2   ntsb_model              163 non-null    object
 3   ntsb_number             163 non-null    object
 4   year                    163 non-null    int64 
 5   date                    163 non-null    object
 6   city                    163 non-null    object
 7   state                   162 non-null    object
 8   country                 163 non-null    object
 9   total_fatalities        163 non-null    int64 
 10  latimes_make            163 non-null    object
 11  latimes_model           163 non-null    object
 12  latimes_make_and_model  163 non-null    object
dtypes: int64(2), object(11)
memory usage: 16.7+ KB

When you import a CSV file with read_csv it will take a guess at column types — for example, integer, float, boolean, datetime or string — but it will default to a generic object type, which will generally behave like a string, or text, field. You can see the data types that pandas assigned to the accident list on the right hand side of the info table.

Take a look above and you’ll see that pandas is treating the date column as an object. That means we can’t chart it using Python’s system for working with dates.

But we can fix that. The to_datetime method included with pandas can handle the conversion. Here’s how to reassign the date column after making the change.

accident_list["date"] = pd.to_datetime(accident_list["date"])

This redefines each object in that column as a date. If your dates are in an unusual or ambiguous format, you may have to pass in a specific formatter, but in this case pandas should be able to guess correctly.

Run info again and you’ll notice a change. The data type for date has changed.

accident_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
 #   Column                  Non-Null Count  Dtype         
---  ------                  --------------  -----         
 0   event_id                163 non-null    object        
 1   ntsb_make               163 non-null    object        
 2   ntsb_model              163 non-null    object        
 3   ntsb_number             163 non-null    object        
 4   year                    163 non-null    int64         
 5   date                    163 non-null    datetime64[ns]
 6   city                    163 non-null    object        
 7   state                   162 non-null    object        
 8   country                 163 non-null    object        
 9   total_fatalities        163 non-null    int64         
 10  latimes_make            163 non-null    object        
 11  latimes_model           163 non-null    object        
 12  latimes_make_and_model  163 non-null    object        
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 16.7+ KB

Now that we’ve got that out of the way, let’s see if we can chart with it, tracking fatalities over time.

alt.Chart(accident_list).mark_bar().encode(
  x="date",
  y="total_fatalities"
)

This is great on the x-axis, but it’s not quite accurate on the y. To make sure this chart is accurate, we’ll need to aggregate the y-axis in some way.

12.4. Aggregate with Altair¶

We could back out and create a new dataset grouped by date, but Altair actually lets us do some of that grouping on the fly. We want to add everything that happens on the same date, so we’ll pop in a sum function on our y column.

alt.Chart(accident_list).mark_bar().encode(
  x="date",
  y="sum(total_fatalities)"
)

This is getting there. But sometimes plotting on a day-by-day basis isn’t all that useful — especially over a long period of time like we have here.

Again, we could back out and create a new DataFrame grouping by month, but we don’t have to — in addition to standard operations (sum, mean, median, etc.), Altair gives us some handy datetime aggregation options. You can find a list of options in the library documentation.

alt.Chart(accident_list).mark_bar().encode(
  x="yearmonth(date)",
  y="sum(total_fatalities)",
)

This is great for showing the pattern of fatalities over time, but it doesn’t give us additional information that might be useful. For example, we almost certainly want to investigate the trend for each manufacturer.

What can do is facet, which will create separate charts, one for each helicopter maker.

alt.Chart(accident_list).mark_bar().encode(
  x="yearmonth(date)",
  y="sum(total_fatalities)",
  facet="latimes_make"
)

12.5. Add a `color`¶

What important fact in the data is this chart not showing? There are two Robinson models in the ranking. It might be nice to emphasize them.

We have that latimes_make column in our original DataFrame, but it got lost when we created our ranking because we didn’t include it in our groupby command. We can fix that by scrolling back up our notebook and adding it to the command. You will need to replace what’s there with a list containing both columns we want to keep.

Make note that because we’re listing more than one column in the groupby call now, we’ll need to surround those column names in a pair of square brackets like so:

accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()

Rerun all of the cells after that one to update everything you’re working with and add the new column.

Note

Remember: If we change a variable, future cells that use that variable won’t change unless we run them again. When you go back and make these changes, make sure to run all of the cells that come after them as well, otherwise you may not get the results you’re expecting.

This is one reason that it can be good to clear cell outputs and rerun your analysis every so often. If you’ve been going back and forth editing cells and tweaking your analysis, you may have saved variables in memory that are no longer accurate. One way to do that is to clear your “kernel” and rerun the whole notebook to make sure everything still runs as you expect it to (In the Jupyter menu, Kernel > Restart Kernel and Clear All Outputs, or Restart Kernel and Run Up to Selected Cell).

Now, when you inspect your merged_list variable, you should see the latimes_make column included.

merged_list.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   latimes_make            12 non-null     object 
 1   latimes_make_and_model  12 non-null     object 
 2   accidents               12 non-null     int64  
 3   total_hours             12 non-null     int64  
 4   per_hour                12 non-null     float64
 5   per_100k_hours          12 non-null     float64
dtypes: float64(2), int64(2), object(2)
memory usage: 708.0+ bytes

Let’s put that to use with an Altair option that we haven’t toyed with yet: color.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-x"),
    color="latimes_make"
).properties(
    title="Helicopter accident rates"
)

Hey now! That wasn’t too hard, was it? But now there are too many colors. It would be easier to read this chart and highlight information we want readers to notice if we used one color for the Robinson bars and made everything else a different color.

The simplest way to do this is to hand Altair a DataFrame with a column that has the values we want to color-code on. We already have the latimes_make columns, but in this case we don’t want that many values; we just want that column to contain one value for the Robinson rows, and another value for all the non-Robinson rows. It doesn’t really matter what those two values are!

How might we go about creating that column? (Hint: We can adapt the technique we learned about in the Filters chapter!)

One way to do this is to create a test for rows with an latimes_make value equal to “ROBINSON”, like so:

merged_list["latimes_make"] == "ROBINSON"

   False
   False
   False
   False
   False
   False
   False
   False
    True
    True
  False
  False
Name: latimes_make, dtype: bool

That will give us a true/false list. In the Filters chapter, we used that list to filter the DataFrame to only rows that matched this test. But we can also simply define a new column and save that list to it. Let’s call the new column robinson.

merged_list["robinson"] = merged_list["latimes_make"] == "ROBINSON"

If you take a look at our merged_list DataFrame, you should now see that new column.

merged_list.head()

	latimes_make	latimes_make_and_model	accidents	total_hours	per_hour	per_100k_hours	robinson
0	AGUSTA	AGUSTA 109	2	362172	5.522238e-06	0.552224	False
1	AIRBUS	AIRBUS 130	1	1053786	9.489593e-07	0.094896	False
2	AIRBUS	AIRBUS 135	4	884596	4.521838e-06	0.452184	False
3	AIRBUS	AIRBUS 350	29	3883490	7.467510e-06	0.746751	False
4	BELL	BELL 206	30	5501308	5.453249e-06	0.545325	False

Now, we can alter our chart to use that new column.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-x"),
    color="robinson"
).properties(
    title="Helicopter accident rates"
)

Bonus: This is fine for exploratory use, but we don’t really need that legend, since it’s adding a highlight to information that’s already included in the names of the helicopters. To hide it, we can use that more advanced syntax and instruct Altair to skip creating a legend.

alt.Chart(merged_list).mark_bar().encode(
    x="per_100k_hours",
    y=alt.Y("latimes_make_and_model").sort("-x"),
    color=alt.Color("robinson", legend=None)
).properties(
    title="Helicopter accident rates"
)

12.6. Polishing your chart¶

These charts give us plenty of areas where we might want to dig in and ask more questions, but none are polished enough to pop into a news story quite yet. But there are lots of additional labeling, formatting and design options that you can dig into in the Altair docs — you can even create Altair themes to specify default color schemes and fonts.

But you may not want to do all that tweaking in Altair, especially if you’re just working on a one-off graphic. If you wanted to hand this chart off to a graphics department, all you’d have to do is head to the top right corner of your chart.

See those three dots? Click on that, and you’ll see lots of options. Downloading the file as an SVG will let anyone with graphics software like Adobe Illustrator take this file and tweak the design.

To get the raw data out, you’ll need to learn one last pandas trick. It’s covered in our final chapter.