3. Pandas

Lucky for us, Python is filled with functions to do pretty much anything you’d ever want to do with a programming language: navigate the web, parse data, interact with a database, run fancy statistics, build a pretty website and so much more.

Creative people have put these tools to work to get a wide range of things done in the academy, the laboratory and even in outer space.

Some of those tools are included in a toolbox that comes with the language, known as the standard library. Others have been built by members of Python’s developer community and need to be downloaded and installed from the web.

One that’s important for this class is called pandas. It is a tool invented at a financial investment firm that has become a leading open-source library for accessing and analyzing data in many different fields.

3.1. Import pandas

Create a new cell at the top of your Jupyter notebook. There we will import the pandas library for use in our script. Type in the following and hit the play button again.

import pandas

If nothing happens, that’s good. It means you have pandas installed and ready as to use.

Note

Since pandas is created by a third party separate from the core Python developers, it wouldn’t be installed by default if you followed our our advanced installation instructions.

It’s available to you because the JupyterLab Desktop developers have pre-selected a curated list of common utilities to include with their installation, another reason to love their easy installer.

Return to the cell with the import and rewrite it like this.

import pandas as pd

This will import the pandas library at the shorter variable name of pd. This is standard practice in the pandas community and you will frequently see examples of pandas code online using it as shorthand. It’s not required, but it’s good to get in the habit so that your code will be understood by other computer programmers.

3.2. Conduct a simple data analysis

Those two little letters contain dozens of data analysis tools that we’ll use in future lessons.

They can import massive data files, compute advanced statistics, filter, sort, rank and do just about anything else you’d want to do.

We’ll get to all of that soon enough, but let’s start out with something simple.

Let’s make a list of numbers in a new notebook cell. To keep things simple, enter all of the even numbers between zero and ten. Press play.

my_list = [2, 4, 6, 8]

If you’re a skilled Python programmer, you can do some cool stuff with any list, and even run some stats. But if you hand over to pandas instead, you’ll be impressed by how easily you can analyze the data without knowing much computer code at all.

In this case, it’s as simple as converting that plain Python list into what pandas calls a Series. Here’s how to make it happen in your next cell.

my_series = pd.Series(my_list)

Once the data becomes a Series, you can immediately run a wide range of descriptive statistics. Let’s try a few.

First, let’s sum all the numbers. Make a new cell and run this. It should spit out the total.

my_series.sum()
20

Then find the maximum value in the next.

my_series.max()
8

The minimum value in the next.

my_series.min()
2

How about the average, which also known as the mean?

my_series.mean()
5.0

The median?

my_series.median()
5.0

and the standard deviation?

my_series.std()
2.581988897471611

Finally, all of the above, plus a little more about the distribution, in one simple command.

my_series.describe()
count    4.000000
mean     5.000000
std      2.581989
min      2.000000
25%      3.500000
50%      5.000000
75%      6.500000
max      8.000000
dtype: float64

Substitute in a series of 10 million records at the top of the notebook — or even just the odd numbers between zero and ten — and your notebook would calculate all those same statistics without you needing to write any more code.

Once your data, however large or complex, is imported into pandas, there’s little limit to what you can do to filter, merge, group, aggregate, compute or chart using simple methods like the ones above.

In the next chapter we’ll get started doing just using data tracking the flow of money in California politics.