8.1. Pandas Built-in Data Visualization#

worksheet in Colab

Pandas offers a simple, high-level interface for creating plots directly from Series and DataFrame objects. It builds on Matplotlib behind the scenes, allowing you to make quick, readable plots with calls like df.plot() or Series.plot(), while still retaining access to Matplotlib’s full range of customization options.

Pandas provides a high-level, convenient API for creating plots directly from Series and DataFrame objects. Under the hood it uses matplotlib, so you get simple, readable calls (df.plot or Series.plot) while still having access to the full power of matplotlib for fine-grained customization.

In this section you will learn:

  1. plotting*: Use

    • df.plot(kind=...) or the shorthand methods such as

    • df.plot.line, df.plot.hist, df.plot.scatter, df.plot.box, etc. to create plots.

  2. custom parameters: Most plotting calls accept Matplotlib keyword arguments such as the ones below and return a Matplotlib Axes object for further customization:

    • figsize

    • lw

    • alpha

    • title

    • xlabel

    • ylabel

  3. style sheets: Style the plots to look globally with plt.style.use(...).

  4. color & size:

    • For scatter plots you can color by a column (c) or

    • size by an array (s=df['col']*scale) and

    • use cmap for colormaps.

  5. fine-tuning: For large or dense data, consider hexbin or KDE plots instead of scatter.

Note: Jupyter Notebooks enable inline plotting (no pop-up) with %matplotlib inline, which is an IPython magic command that tells Jupyter Notebook to render Matplotlib plots directly inside the notebook output cell instead of in a separate window. You don’t need %matplotlib inline in Jupyter environment with IPython >= 7.

This common plot types and practical tips on how to produce exploratory plots and figures in Pandas include:

Plot Type

Command Example

Usage Example

Area

df.plot.area(alpha=...)

Show cumulative totals or overlapping trends over time

Bar / Stacked Bar

df.plot.bar(...), df.plot.bar(stacked=True)

Compare categorical data; stacked shows parts of a whole

Histogram

df.plot.hist(bins=...)

Display frequency distribution of numeric data

Line

df.plot.line(x=..., y=..., figsize=..., lw=...)

Visualize trends or changes over time

Scatter

df.plot.scatter(x=..., y=..., c=..., cmap=..., s=...)

Explore relationships or correlations between two variables

Box

df.plot.box()

Summarize data distribution and detect outliers

Hexbin

df.plot.hexbin(x=..., y=..., gridsize=..., cmap=...)

Visualize density of points in large scatter datasets

KDE / Density

df.plot.kde(), df.plot.density()

Estimate the probability density function of a variable

Pie

df.plot.pie(...)

Show proportions or percentage breakdowns of a whole

Later, when you learn Matplotlib, you will see why these methods of plotting are a lot easier to use. Pandas visualization balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib.plt call.

%pip install pandas numpy --quiet   ### ensure pandas and numpy are installed; uncomment when done
%pip install matplotlib --quiet   ### ensure matplotlib is installed; uncomment when done

import numpy as np
import pandas as pd

# %pip install matplotlib --quiet   ### ensure matplotlib is installed; uncomment when done
import matplotlib.pyplot as plt    ### because this is still matplotlib
# %matplotlib inline
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

8.1.1. Loading Data#

There are some fake data csv files you can read in as dataframes:

df1 = pd.read_csv('../../data/df1',index_col=0)
df2 = pd.read_csv('../../data/df2')
### what does the data look like?

df1.head()
A B C D
2000-01-01 1.339091 -0.163643 -0.646443 1.041233
2000-01-02 -0.774984 0.137034 -0.882716 -2.253382
2000-01-03 -0.921037 -0.482943 -0.417100 0.478638
2000-01-04 -1.738808 -0.072973 0.056517 0.015085
2000-01-05 -0.905980 1.778576 0.381918 0.291436
### descriptive statistics

df1.describe()
A B C D
count 1000.000000 1000.000000 1000.000000 1000.000000
mean -0.017755 0.048072 -0.001723 0.002432
std 0.957223 1.004197 0.982384 1.066366
min -3.693201 -2.719020 -2.987971 -3.182746
25% -0.639101 -0.652530 -0.690831 -0.676107
50% -0.017793 0.058035 -0.012805 -0.044868
75% 0.623478 0.696946 0.706496 0.721699
max 3.412236 3.199850 3.342484 2.879793
df2.head()
a b c d
0 0.039762 0.218517 0.103423 0.957904
1 0.937288 0.041567 0.899125 0.977680
2 0.780504 0.008948 0.557808 0.797510
3 0.672717 0.247870 0.264071 0.444358
4 0.053829 0.520124 0.552264 0.190008
df2.describe()
a b c d
count 10.000000 10.000000 10.000000 10.000000
mean 0.460880 0.352935 0.587008 0.631597
std 0.340793 0.301272 0.284332 0.258158
min 0.039762 0.008948 0.103423 0.190008
25% 0.212334 0.179302 0.427949 0.457694
50% 0.371366 0.240298 0.555036 0.584144
75% 0.753558 0.515799 0.873619 0.837267
max 0.937288 0.997075 0.907307 0.977680

8.1.2. Style Sheets#

Pandas by default pulls colors from Matplotlib’s axes.prop_cycle, a Matplotlib rcParam (runtime configuration parameter), which is a color iterator (cycler) that cycles through a list of predefined colors. That’s why you may see different colors (by default 3) when you plot multiple lines.

Matplotlib has style sheets (or themes) you can use to make your plots look a little nicer. Popular stylesheets include:

  • plot_bmh (Bayesian Methods for Hackers)

  • plot_fivethirtyeight (FiveThirtyEight is a news site)

  • plot_ggplot (R’s ggplot2 default theme)

The style sheet, or themes, basically create a set of style rules that your plots follow. The use of stylesheet gives your plots a unified look and feel more professional. You can even create your own stylesheet.

Before plt.style.use():

df1['A'].hist()
<Axes: >
../../_images/0ba1d8cb2aae51a93c2de7ebf44ff260bf63e8650e55ce77be4367a07810b062.png

Call the stylesheet. Let’s use the ggplot theme:

plt.style.use('ggplot')

After applying plt.style.use(ggplot):

df1['A'].hist()
<Axes: >
../../_images/21ff062ae896d69701faadaa6ce736d7ea4633a294d1a88fa5e6404b5de821c2.png

Now try plt.style.use(bmh):

plt.style.use('bmh')
df1['A'].hist()
<Axes: >
../../_images/204a078e6a639e353bcaf1d7caf67610c4e64b0955024565358358b3eff1efd5.png

fivethirtyeight

plt.style.use('fivethirtyeight')
df1['A'].hist()
<Axes: >
../../_images/49c33b14de86aa734b83d08fd20560a6c36ac427b67304bc5dc08b93b9d46086.png

A dark background theme:

plt.style.use('dark_background')
df1['A'].hist()
<Axes: >
../../_images/704ea2e6196f869168552ffdf32789bd0e9d1c78a6db5cbece8e56e736568f52.png

Let’s stick with the ggplot style for now.

plt.style.use('ggplot')
# plt.style.use('bmh')

8.1.3. Plot Types#

There are several plot types built-in to Pandas, most of them statistical plots by nature:

  • df.plot.area

  • df.plot.bar & df.plot.barh

  • df.plot.density

  • df.plot.hist

  • df.plot.line

  • df.plot.scatter

  • df.plot.bar

  • df.plot.box

  • df.plot.hexbin

  • df.plot.kde

  • df.plot.pie

Let’s call these plot type methods (the key terms shown in the list above, e.g. ‘box’, ‘barh’, etc) to see how they work.

### remember the dataframe

df2.head()
a b c d
0 0.039762 0.218517 0.103423 0.957904
1 0.937288 0.041567 0.899125 0.977680
2 0.780504 0.008948 0.557808 0.797510
3 0.672717 0.247870 0.264071 0.444358
4 0.053829 0.520124 0.552264 0.190008

8.1.3.1. Area Plot#

df2.plot.area(alpha=0.4)
<Axes: >
../../_images/1d36d79209fc6b0ebaaaf21e41c43f887fadd98d8817edaf81c8ce22565a8d4f.png

8.1.3.2. Bar Plots#

Bar plots are one of the most common ways to compare categorical data visually. They represent quantities as rectangular bars whose length (or height) corresponds to the value being measured. In Python, these are typically created using Matplotlib or Seaborn.

Both bar() and barh() create bar charts — the difference is simply orientation. Stacked bar charts, on the other hand, how multiple subcategories contribute to a total within each main category.

Plot Type

Orientation

Purpose

Best For

Bar Plot

Vertical

Compare category values

Simple category comparisons

Barh Plot

Horizontal

Compare category values with long labels

Readability and ranking-type data

Stacked Bar

Either

Show part-to-whole relationships

Composition of totals across categories

8.1.3.2.1. Vertical Bar Plot#

### bar plot

df2.plot.bar()
<Axes: >
../../_images/d4673405b5ab5074f113cddb4f642b345bcdcc03a2ce8a40413693c763216e50.png

8.1.3.2.2. Horizontal Bar Plot#

### bar plot horizontal

df2.plot.barh()
<Axes: >
../../_images/d6222a4597e8945c6df0183dd2143a069f6657b9214a52a022f21a3910fb417b.png
### bar plot stacked

df2.plot.bar(stacked=True)
<Axes: >
../../_images/cb8982e0ea9c40ca22597a0dcbbcf1b05dc8f74f6a759f39f3272cb2f6529946.png

8.1.3.3. Histograms#

df1['A'].plot.hist(bins=100)
<Axes: ylabel='Frequency'>
../../_images/d76e755b14608208a84f5937dfa9935acd1852dba03707877bbb933c6b5fa766.png

8.1.3.4. Line Plots#

# df1.plot.line(x=df1.index, y='B',figsize=(12,3),lw=1)

df1.plot.line(y='B',figsize=(10,6),lw=5)
<Axes: >
../../_images/214f9662121f6cc7bc11d6279ad4205335f20f35768cad67bddfd6ea611fe12b.png

8.1.3.5. Scatter Plots#

df1.plot.scatter(x='A',y='B')
<Axes: xlabel='A', ylabel='B'>
../../_images/d469d74b1421e64c5ebbb1f2c65ba3c2c8f1b94ee20543dabe40af9426357523.png

8.1.4. Color Maps#

### the color of each point corresponds to the values in column C
### try them out

# df1.plot.scatter(x=df1['A'], y=df1['B'], c=df1['C'], cmap='coolwarm')  ### need to use x='A' instead of x=df1['A']
df1.plot.scatter(x='A', y='B', c='C', cmap='coolwarm')       ### color value from 'C'; colormap
df1.plot.scatter(x='A', y='B', c='C', cmap='viridis')      ### another colormap
<Axes: xlabel='A', ylabel='B'>
../../_images/2d4ae987b56c01bb1cdc5ad5a3e6d88868c9d08a4655a626888895699661ec8e.png ../../_images/5c8f4360651e18aea1a1d64c6af06d1191b09163ebb627b818842eeabbb449f6.png

8.1.5. Size#

Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:

### try different sizes

# df1.plot.scatter(x='A', y='B', s=df1['C']*200)
df1.plot.scatter(x='A',y='B',s=df1['C']*10)
/Users/tychen/workspace/dsm/.venv/lib/python3.13/site-packages/matplotlib/collections.py:999: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor
<Axes: xlabel='A', ylabel='B'>
../../_images/ecaece2a69246c4155298315baa30931f2b3693d4c12683050117a1f4d3882f0.png

8.1.5.1. BoxPlots#

df2.plot.box() # Can also pass a by= argument for groupby
<Axes: >
../../_images/e2c20fe165e827cdfc5dca9554779b22a9f7b15a1b8d3f3d9e7804d24aa2bd9a.png

8.1.5.2. Hexagonal Bin Plot#

Useful for Bivariate Data, alternative to scatterplot:

np.random.seed(42)
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a', y='b', gridsize=25, cmap='Oranges')
<Axes: xlabel='a', ylabel='b'>
../../_images/fee4f0509db01a3a42e611a982f8b5b86e5b8cc7a76efa3bea74d3ba1a79efaf.png

8.1.5.3. Kernel Density Estimation plot (KDE)#

  • A smooth version of histogram

%pip install scipy    ### ensure scipy is installed; uncomment when done

df2['a'].plot.kde()
Collecting scipy
  Using cached scipy-1.16.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Requirement already satisfied: numpy<2.6,>=1.25.2 in /Users/tychen/workspace/dsm/.venv/lib/python3.13/site-packages (from scipy) (2.3.3)
Using cached scipy-1.16.2-cp313-cp313-macosx_14_0_arm64.whl (20.9 MB)
Installing collected packages: scipy
Successfully installed scipy-1.16.2
Note: you may need to restart the kernel to use updated packages.
<Axes: ylabel='Density'>
../../_images/02f36622e6423688adc2c8e0fcb1c7fe980afb1bf098e4b36b5a7991727c89bd.png

8.1.5.4. Density Plot#

df2.plot.density()
Requirement already satisfied: scipy in /Users/tcn85/workspace/dsm/.venv/lib/python3.13/site-packages (1.16.2)
Requirement already satisfied: numpy<2.6,>=1.25.2 in /Users/tcn85/workspace/dsm/.venv/lib/python3.13/site-packages (from scipy) (2.3.3)
Note: you may need to restart the kernel to use updated packages.
<Axes: ylabel='Density'>
../../_images/e38eabbc2a42e35e07a08dd5910d6e1880d96b3de8501a2d302047e1e03a489b.png

Using density plot as an example of how Pandas visualization differs from Matplotlib and Seaborn:

Library

Function

Notes

Seaborn

sns.kdeplot()

most common, easy to use

Pandas

.plot(kind="density")

convenient for quick plots

Matplotlib + SciPy

gaussian_kde()

manual control over details

The end.