12. Charts¶
Python has a number of charting tools that can work hand-in-hand with pandas. While Altair is a relatively new package compared to classics like matplotlib, it has great documentation and is easy to configure. Let’s take it for a spin.
12.1. Make a basic bar chart¶
The first thing we need to do is import Altair. In the tradition of pandas, we’ll import it with the alias alt
to reduce how much we need to type later on.
Show code cell content
import warnings
warnings.simplefilter("ignore")
import pandas as pd
accident_list = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/ntsb-accidents.csv")
accident_list["latimes_make_and_model"] = accident_list["latimes_make_and_model"].str.upper()
accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()
survey = pd.read_csv("https://raw.githubusercontent.com/palewire/first-python-notebook/main/docs/src/_static/faa-survey.csv")
survey["latimes_make_and_model"] = survey["latimes_make_and_model"].str.upper()
merged_list = pd.merge(accident_counts, survey, on="latimes_make_and_model")
merged_list["per_hour"] = merged_list.accidents / merged_list.total_hours
merged_list["per_100k_hours"] = (merged_list.accidents / merged_list.total_hours) * 100_000
import altair as alt
Note
If the import triggers an error that says your notebook doesn’t have Altair, you can install it by running %pip install altair
in a cell. This will download and install the library using the pip package manager and Jupyter’s built-in magic command.
In a typical analysis, you’d import all of your libraries in one cell at the top of the file. That way, if you need to install or make changes to the packages a notebook uses, you know where to find them and you won’t hit errors importing a package midway through running a file.
With Altair imported, we can now feed it our DataFrame to make a simple bar chart. Let’s take a look at the basic building block of an Altair chart: the Chart
object. We’ll tell it that we want to create a chart from merged_list
by passing the DataFrame in, like so:
alt.Chart(merged_list)
---------------------------------------------------------------------------
SchemaValidationError Traceback (most recent call last)
File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/vegalite/v5/api.py:3778, in Chart.to_dict(self, validate, format, ignore, context)
3776 copy.data = core.InlineData(values=[{}])
3777 return super(Chart, copy).to_dict(**kwds)
-> 3778 return super().to_dict(**kwds)
File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/vegalite/v5/api.py:1813, in TopLevelMixin.to_dict(self, validate, format, ignore, context)
1808 context["top_level"] = False
1810 # TopLevelMixin instance does not necessarily have to_dict defined
1811 # but due to how Altair is set up this should hold.
1812 # Too complex to type hint right now
-> 1813 vegalite_spec: Any = super(TopLevelMixin, copy).to_dict( # type: ignore[misc]
1814 validate=validate, ignore=ignore, context=dict(context, pre_transform=False)
1815 )
1817 # TODO: following entries are added after validation. Should they be validated?
1818 if is_top_level:
1819 # since this is top-level we add $schema if it's missing
File ~/.local/share/virtualenvs/first-python-notebook-nHqMtVV2/lib/python3.11/site-packages/altair/utils/schemapi.py:1076, in SchemaBase.to_dict(self, validate, ignore, context)
1069 self.validate(result)
1070 except jsonschema.ValidationError as err:
1071 # We do not raise `from err` as else the resulting
1072 # traceback is very long as it contains part
1073 # of the Vega-Lite schema. It would also first
1074 # show the less helpful ValidationError instead of
1075 # the more user friendly SchemaValidationError
-> 1076 raise SchemaValidationError(self, err) from None
1077 return result
SchemaValidationError: '{'data': {'name': 'data-37aa3d2cc96e41928ba6304b789aa722'}}' is an invalid value.
'mark' is a required property
alt.Chart(...)
OK! We got an error, but don’t panic. The error says that Altair needs a “mark” — that is to say, it needs to know not only what data we want to visualize, but also how to represent that data visually. There are lots of different marks that Altair can use (you can check them all out here). But let’s try out the most versatile mark in our visualization toolbox: the bar.
alt.Chart(merged_list).mark_bar()
That’s an improvement, but we’ve got a new error: Altair doesn’t know which columns of our DataFrame to look at! At a minimum, we also need to define the column to use for the x- and y-axes. We can do that by chaining in the encode
method.
alt.Chart(merged_list).mark_bar().encode(
x="latimes_make_and_model",
y="per_100k_hours"
)
That’s more like it!
Here’s an idea — maybe we do horizontal bars instead of vertical. How would you rewrite this chart code to reverse those bars?
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y="latimes_make_and_model"
)
This chart is an okay start, but it’s sorted alphabetically by y-axis value, which is pretty sloppy and hard to visually parse. Let’s fix that.
We want to sort the y-axis values by their corresponding x values. We know how to do that in Pandas, but Altair has its own opinions about how to sort a DataFrame, so it will override any sort order on the DataFrame we pass in.
Until now, we’ve been using the shorthand syntax to create our axes, but to add more customization to our chart we’ll have to switch to the longform way of defining the y-axis.
To do that, we’ll use a syntax like this: alt.Y(column_name)
. Instead of passing a string to y
and letting Altair do the rest, this lets us create a y-axis object and then give it additional instructions.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model")
)
This chart should look identical to our previous attempt when we created the y-axis the simpler way, but it opens up new options! Now we can instruct Altair to sort the y-axis by the x-axis values.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("x")
)
That’s looking a lot neater! By default, the sort order will be small to large. Visually, if we want to feature the highest accident rates, it probably makes sense to reverse that order. We can do that by adding a minus before the axis name.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-x")
)
And we can’t have a chart without context. Let’s throw in a title for good measure.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-x")
).properties(
title="Helicopter accident rates"
)
Yay, we made a chart!
12.2. Other marks¶
What if we wanted to switch it up and show this data in a slightly different form? For example, in the Los Angeles Times story, the fatal accident rate is shown as a scaled circle.
We can try that out with just a few small tweaks, using Altair’s mark_circle
option. We’ll keep the y
encoding, since we still want to split out our chart by make and model. Instead of an x
encoding, though, we’ll pass in a size
encoding, which will pin the radius of each circle to that rate calculation. And hey, while we’re at it, let’s throw in an interactive tooltip that displays the accident rate when users hover over a mark.
alt.Chart(merged_list).mark_circle().encode(
size="per_100k_hours",
y="latimes_make_and_model",
tooltip="per_100k_hours"
)
A nice little change from all the bar charts! But once again, the default sorting alphabetical by name. Instead, it would be really nice to sort this by rate, as we did with the bar chart. How would we go about that?
alt.Chart(merged_list).mark_circle().encode(
size="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-size"),
tooltip="per_100k_hours"
)
12.3. datetime
data¶
One thing you’ll almost certainly find yourself grappling with time and time again is date (and time) fields, so let’s talk about how to handle them.
Let’s see if we can do that with our original DataFrame, the accident_list
that contains one record for every helicopter accident. We can remind ourselves what it contains with the info
command.
accident_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event_id 163 non-null object
1 ntsb_make 163 non-null object
2 ntsb_model 163 non-null object
3 ntsb_number 163 non-null object
4 year 163 non-null int64
5 date 163 non-null object
6 city 163 non-null object
7 state 162 non-null object
8 country 163 non-null object
9 total_fatalities 163 non-null int64
10 latimes_make 163 non-null object
11 latimes_model 163 non-null object
12 latimes_make_and_model 163 non-null object
dtypes: int64(2), object(11)
memory usage: 16.7+ KB
When you import a CSV file with read_csv
it will take a guess at column types — for example, integer
, float
, boolean
, datetime
or string
— but it will default to a generic object
type, which will generally behave like a string, or text, field. You can see the data types that pandas assigned to the accident list on the right hand side of the info
table.
Take a look above and you’ll see that pandas is treating the date
column as an object. That means we can’t chart it using Python’s system for working with dates.
But we can fix that. The to_datetime
method included with pandas
can handle the conversion. Here’s how to reassign the date
column after making the change.
accident_list["date"] = pd.to_datetime(accident_list["date"])
This redefines each object in that column as a date. If your dates are in an unusual or ambiguous format, you may have to pass in a specific formatter, but in this case pandas should be able to guess correctly.
Run info
again and you’ll notice a change. The data type for date
has changed.
accident_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 163 entries, 0 to 162
Data columns (total 13 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 event_id 163 non-null object
1 ntsb_make 163 non-null object
2 ntsb_model 163 non-null object
3 ntsb_number 163 non-null object
4 year 163 non-null int64
5 date 163 non-null datetime64[ns]
6 city 163 non-null object
7 state 162 non-null object
8 country 163 non-null object
9 total_fatalities 163 non-null int64
10 latimes_make 163 non-null object
11 latimes_model 163 non-null object
12 latimes_make_and_model 163 non-null object
dtypes: datetime64[ns](1), int64(2), object(10)
memory usage: 16.7+ KB
Now that we’ve got that out of the way, let’s see if we can chart with it, tracking fatalities over time.
alt.Chart(accident_list).mark_bar().encode(
x="date",
y="total_fatalities"
)
This is great on the x-axis, but it’s not quite accurate on the y. To make sure this chart is accurate, we’ll need to aggregate the y-axis in some way.
12.4. Aggregate with Altair¶
We could back out and create a new dataset grouped by date, but Altair actually lets us do some of that grouping on the fly. We want to add everything that happens on the same date, so we’ll pop in a sum
function on our y column.
alt.Chart(accident_list).mark_bar().encode(
x="date",
y="sum(total_fatalities)"
)
This is getting there. But sometimes plotting on a day-by-day basis isn’t all that useful — especially over a long period of time like we have here.
Again, we could back out and create a new DataFrame grouping by month, but we don’t have to — in addition to standard operations (sum, mean, median, etc.), Altair gives us some handy datetime aggregation options. You can find a list of options in the library documentation.
alt.Chart(accident_list).mark_bar().encode(
x="yearmonth(date)",
y="sum(total_fatalities)",
)
This is great for showing the pattern of fatalities over time, but it doesn’t give us additional information that might be useful. For example, we almost certainly want to investigate the trend for each manufacturer.
What can do is facet, which will create separate charts, one for each helicopter maker.
alt.Chart(accident_list).mark_bar().encode(
x="yearmonth(date)",
y="sum(total_fatalities)",
facet="latimes_make"
)
12.5. Add a color
¶
What important fact in the data is this chart not showing? There are two Robinson models in the ranking. It might be nice to emphasize them.
We have that latimes_make
column in our original DataFrame, but it got lost when we created our ranking because we didn’t include it in our groupby
command. We can fix that by scrolling back up our notebook and adding it to the command. You will need to replace what’s there with a list containing both columns we want to keep.
Make note that because we’re listing more than one column in the groupby
call now, we’ll need to surround those column names in a pair of square brackets like so:
accident_counts = accident_list.groupby(["latimes_make", "latimes_make_and_model"]).size().rename("accidents").reset_index()
Rerun all of the cells after that one to update everything you’re working with and add the new column.
Note
Remember: If we change a variable, future cells that use that variable won’t change unless we run them again. When you go back and make these changes, make sure to run all of the cells that come after them as well, otherwise you may not get the results you’re expecting.
This is one reason that it can be good to clear cell outputs and rerun your analysis every so often. If you’ve been going back and forth editing cells and tweaking your analysis, you may have saved variables in memory that are no longer accurate. One way to do that is to clear your “kernel” and rerun the whole notebook to make sure everything still runs as you expect it to (In the Jupyter menu, Kernel
> Restart Kernel and Clear All Outputs
, or Restart Kernel and Run Up to Selected Cell
).
Now, when you inspect your merged_list
variable, you should see the latimes_make
column included.
merged_list.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12 entries, 0 to 11
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 latimes_make 12 non-null object
1 latimes_make_and_model 12 non-null object
2 accidents 12 non-null int64
3 total_hours 12 non-null int64
4 per_hour 12 non-null float64
5 per_100k_hours 12 non-null float64
dtypes: float64(2), int64(2), object(2)
memory usage: 708.0+ bytes
Let’s put that to use with an Altair option that we haven’t toyed with yet: color
.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-x"),
color="latimes_make"
).properties(
title="Helicopter accident rates"
)
Hey now! That wasn’t too hard, was it? But now there are too many colors. It would be easier to read this chart and highlight information we want readers to notice if we used one color for the Robinson bars and made everything else a different color.
The simplest way to do this is to hand Altair a DataFrame with a column that has the values we want to color-code on. We already have the latimes_make
columns, but in this case we don’t want that many values; we just want that column to contain one value for the Robinson rows, and another value for all the non-Robinson rows. It doesn’t really matter what those two values are!
How might we go about creating that column? (Hint: We can adapt the technique we learned about in the Filters chapter!)
One way to do this is to create a test for rows with an latimes_make
value equal to “ROBINSON”, like so:
merged_list["latimes_make"] == "ROBINSON"
0 False
1 False
2 False
3 False
4 False
5 False
6 False
7 False
8 True
9 True
10 False
11 False
Name: latimes_make, dtype: bool
That will give us a true/false list. In the Filters chapter, we used that list to filter the DataFrame to only rows that matched this test. But we can also simply define a new column and save that list to it. Let’s call the new column robinson
.
merged_list["robinson"] = merged_list["latimes_make"] == "ROBINSON"
If you take a look at our merged_list
DataFrame, you should now see that new column.
merged_list.head()
latimes_make | latimes_make_and_model | accidents | total_hours | per_hour | per_100k_hours | robinson | |
---|---|---|---|---|---|---|---|
0 | AGUSTA | AGUSTA 109 | 2 | 362172 | 5.522238e-06 | 0.552224 | False |
1 | AIRBUS | AIRBUS 130 | 1 | 1053786 | 9.489593e-07 | 0.094896 | False |
2 | AIRBUS | AIRBUS 135 | 4 | 884596 | 4.521838e-06 | 0.452184 | False |
3 | AIRBUS | AIRBUS 350 | 29 | 3883490 | 7.467510e-06 | 0.746751 | False |
4 | BELL | BELL 206 | 30 | 5501308 | 5.453249e-06 | 0.545325 | False |
Now, we can alter our chart to use that new column.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-x"),
color="robinson"
).properties(
title="Helicopter accident rates"
)
Bonus: This is fine for exploratory use, but we don’t really need that legend, since it’s adding a highlight to information that’s already included in the names of the helicopters. To hide it, we can use that more advanced syntax and instruct Altair to skip creating a legend.
alt.Chart(merged_list).mark_bar().encode(
x="per_100k_hours",
y=alt.Y("latimes_make_and_model").sort("-x"),
color=alt.Color("robinson", legend=None)
).properties(
title="Helicopter accident rates"
)
12.6. Polishing your chart¶
These charts give us plenty of areas where we might want to dig in and ask more questions, but none are polished enough to pop into a news story quite yet. But there are lots of additional labeling, formatting and design options that you can dig into in the Altair docs — you can even create Altair themes to specify default color schemes and fonts.
But you may not want to do all that tweaking in Altair, especially if you’re just working on a one-off graphic. If you wanted to hand this chart off to a graphics department, all you’d have to do is head to the top right corner of your chart.
See those three dots? Click on that, and you’ll see lots of options. Downloading the file as an SVG will let anyone with graphics software like Adobe Illustrator take this file and tweak the design.
To get the raw data out, you’ll need to learn one last pandas trick. It’s covered in our final chapter.