3. Pandas

Python is filled with functions to do pretty much anything you’d ever want to do with a programming language: navigate the web, parse data, interact with a database, run fancy statistics, build a pretty website and so much more.

Creative people have put these tools to work to get a wide range of things done in the academy, the laboratory and even in outer space. Some are included in a toolbox that comes with the language, known as the standard library. Others have been built by members of Python’s developer community and need to be downloaded and installed from the web.

pandas on PyPI

One third-party tool that’s important for this class is called pandas. It was invented for use at a financial investment firm and has become the leading open-source library for accessing and analyzing data in many different fields.

3.1. Import pandas

Create a new cell at the top of your notebook where we will import pandas for our use. Type in the following and hit the play button.

import pandas

If nothing happens, that’s good. It means you have pandas installed and ready as to use.

Note

Since pandas is created by a third party independent from the core Python developers, it wouldn’t be installed by default if you followed our our advanced installation instructions.

It’s available to you because the JupyterLab Desktop developers have pre-selected a curated list of common utilities to include with the package, another reason to love their easy installer.

If your notebook doesn’t have pandas, you can install it by running %pip install pandas in a cell. This will download and install the library using the pip package manager and Jupyter’s built-in magic command.

Return to the cell with the import and rewrite it like this.

import pandas as pd

This will import the pandas library at the shorter variable name of pd. This is standard practice in the pandas community. You will frequently see examples of pandas code online using pd as shorthand. It’s not required, but it’s good to get in the habit so that your code is more likely to be quickly understood by other computer programmers.

Note

In Python, a variable is a way to store a value in memory for later use. A variable is a named location in the computer’s memory where a value can be stored and retrieved. Variables are used to store data values, such as numbers, strings, lists, or objects, and they can be used throughout the program to refer to the stored value.

To create your own variable in Python, you use the assignment operator (=) to assign a value to a variable. The variable name is on the left side of the assignment operator and the value is on the right side.

3.2. Conduct a simple data analysis

Those two little letters contain dozens of data analysis tools that we’ll use in future lessons. They can read in millions of records, compute advanced statistics, filter, sort, rank and do just about anything else you’d want to do with data.

We’ll get to all of that soon enough, but let’s start out with something simple.

Let’s make a list of numbers in a new notebook cell. To keep things simple, enter all of the even numbers between zero and ten. Name its variable something plain like my_list. Press play.

my_list = [2, 4, 6, 8]

You can do cool stuff with any list, even calculate advanced statistics, if you’re a skilled Python programmer who is ready and willing to write a big chunk of code. The advantage of pandas is that it saves time by quickly and easily analyzing data with hardly any computer code at all.

In this case, it’s as simple as converting that plain Python list into what pandas calls a Series. Here’s how to make it happen in your next cell. Let’s stick with simple variables and name it my_series.

my_series = pd.Series(my_list)

Once the data becomes a Series, you can immediately run a wide range of descriptive statistics. Let’s try a few.

How about summing all the numbers? Make a new cell and run this. It should spit out the total.

my_series.sum()
20

Then find the maximum value in the next.

my_series.max()
8

The minimum value in the next.

my_series.min()
2

How about the average, which also known as the mean?

my_series.mean()
5.0

The median?

my_series.median()
5.0

The standard deviation?

my_series.std()
2.581988897471611

Finally, all of the above, plus a little more about the distribution, in one simple command.

my_series.describe()
count    4.000000
mean     5.000000
std      2.581989
min      2.000000
25%      3.500000
50%      5.000000
75%      6.500000
max      8.000000
dtype: float64

Before you move on, go back to the cell with your my_list variable and change what’s in the list. Here I’ll change the values from evens to odds.

my_list = [1, 3, 5, 7, 9]

Then rerun all the cells below it. You’ll see all the statistics update to reflect the different dataset, for instance, the final describe call change to:

Hide code cell content
my_series = pd.Series(my_list)
my_series.describe()
count    5.000000
mean     5.000000
std      3.162278
min      1.000000
25%      3.000000
50%      5.000000
75%      7.000000
max      9.000000
dtype: float64

If you substituted in a series of 10 million records, your notebook would calculate all those same statistics without you needing to write any more code. Once your data, however large or complex, is imported into pandas, there’s little limit to what you can do to filter, merge, group, aggregate, compute or chart using simple methods like the ones above. In the chapter to come we’ll start doing just using that with data from a real Los Angeles Times investigation.