Pandas Built-in Data Visualization

8.1. Pandas Built-in Data Visualization#

Pandas offers a simple, high-level interface for creating plots directly from Series and DataFrame objects. It builds on Matplotlib behind the scenes, allowing you to make quick, readable plots with calls like df.plot() or Series.plot(), while still retaining access to Matplotlib’s full range of customization options.

Pandas provides a high-level, convenient API for creating plots directly from Series and DataFrame objects. Under the hood it uses matplotlib, so you get simple, readable calls (df.plot or Series.plot) while still having access to the full power of matplotlib for fine-grained customization.

In this section you will learn:

plotting*: Use
- df.plot(kind=...) or the shorthand methods such as
- df.plot.line, df.plot.hist, df.plot.scatter, df.plot.box, etc. to create plots.
custom parameters: Most plotting calls accept Matplotlib keyword arguments such as the ones below and return a Matplotlib Axes object for further customization:
- figsize
- lw
- alpha
- title
- xlabel
- ylabel
style sheets: Style the plots to look globally with plt.style.use(...).
color & size:
- For scatter plots you can color by a column (c) or
- size by an array (s=df['col']*scale) and
- use cmap for colormaps.
fine-tuning: For large or dense data, consider hexbin or KDE plots instead of scatter.

Note: Jupyter Notebooks enable inline plotting (no pop-up) with %matplotlib inline, which is an IPython magic command that tells Jupyter Notebook to render Matplotlib plots directly inside the notebook output cell instead of in a separate window. You don’t need %matplotlib inline in Jupyter environment with IPython >= 7.

This common plot types and practical tips on how to produce exploratory plots and figures in Pandas include:

Plot Type	Command Example	Usage Example
Area	`df.plot.area(alpha=...)`	Show cumulative totals or overlapping trends over time
Bar / Stacked Bar	`df.plot.bar(...)`, `df.plot.bar(stacked=True)`	Compare categorical data; stacked shows parts of a whole
Histogram	`df.plot.hist(bins=...)`	Display frequency distribution of numeric data
Line	`df.plot.line(x=..., y=..., figsize=..., lw=...)`	Visualize trends or changes over time
Scatter	`df.plot.scatter(x=..., y=..., c=..., cmap=..., s=...)`	Explore relationships or correlations between two variables
Box	`df.plot.box()`	Summarize data distribution and detect outliers
Hexbin	`df.plot.hexbin(x=..., y=..., gridsize=..., cmap=...)`	Visualize density of points in large scatter datasets
KDE / Density	`df.plot.kde()`, `df.plot.density()`	Estimate the probability density function of a variable
Pie	`df.plot.pie(...)`	Show proportions or percentage breakdowns of a whole

Later, when you learn Matplotlib, you will see why these methods of plotting are a lot easier to use. Pandas visualization balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib.plt call.

%pip install pandas numpy --quiet   ### ensure pandas and numpy are installed; uncomment when done
%pip install matplotlib --quiet   ### ensure matplotlib is installed; uncomment when done

import numpy as np
import pandas as pd

# %pip install matplotlib --quiet   ### ensure matplotlib is installed; uncomment when done
import matplotlib.pyplot as plt    ### because this is still matplotlib
# %matplotlib inline

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

8.1.1. Loading Data#

There are some fake data csv files you can read in as dataframes:

df1 = pd.read_csv('../../data/df1',index_col=0)
df2 = pd.read_csv('../../data/df2')

### what does the data look like?

df1.head()

	A	B	C	D
2000-01-01	1.339091	-0.163643	-0.646443	1.041233
2000-01-02	-0.774984	0.137034	-0.882716	-2.253382
2000-01-03	-0.921037	-0.482943	-0.417100	0.478638
2000-01-04	-1.738808	-0.072973	0.056517	0.015085
2000-01-05	-0.905980	1.778576	0.381918	0.291436

### descriptive statistics

df1.describe()

	A	B	C	D
count	1000.000000	1000.000000	1000.000000	1000.000000
mean	-0.017755	0.048072	-0.001723	0.002432
std	0.957223	1.004197	0.982384	1.066366
min	-3.693201	-2.719020	-2.987971	-3.182746
25%	-0.639101	-0.652530	-0.690831	-0.676107
50%	-0.017793	0.058035	-0.012805	-0.044868
75%	0.623478	0.696946	0.706496	0.721699
max	3.412236	3.199850	3.342484	2.879793

df2.head()

	a	b	c	d
0	0.039762	0.218517	0.103423	0.957904
1	0.937288	0.041567	0.899125	0.977680
2	0.780504	0.008948	0.557808	0.797510
3	0.672717	0.247870	0.264071	0.444358
4	0.053829	0.520124	0.552264	0.190008

df2.describe()

	a	b	c	d
count	10.000000	10.000000	10.000000	10.000000
mean	0.460880	0.352935	0.587008	0.631597
std	0.340793	0.301272	0.284332	0.258158
min	0.039762	0.008948	0.103423	0.190008
25%	0.212334	0.179302	0.427949	0.457694
50%	0.371366	0.240298	0.555036	0.584144
75%	0.753558	0.515799	0.873619	0.837267
max	0.937288	0.997075	0.907307	0.977680

8.1.2. Style Sheets#

Pandas by default pulls colors from Matplotlib’s axes.prop_cycle, a Matplotlib rcParam (runtime configuration parameter), which is a color iterator (cycler) that cycles through a list of predefined colors. That’s why you may see different colors (by default 3) when you plot multiple lines.

Matplotlib has style sheets (or themes) you can use to make your plots look a little nicer. Popular stylesheets include:

plot_bmh (Bayesian Methods for Hackers)
plot_fivethirtyeight (FiveThirtyEight is a news site)
plot_ggplot (R’s ggplot2 default theme)

The style sheet, or themes, basically create a set of style rules that your plots follow. The use of stylesheet gives your plots a unified look and feel more professional. You can even create your own stylesheet.

Before plt.style.use():

df1['A'].hist()

<Axes: >

../../_images/0ba1d8cb2aae51a93c2de7ebf44ff260bf63e8650e55ce77be4367a07810b062.png

Call the stylesheet. Let’s use the ggplot theme:

plt.style.use('ggplot')

After applying plt.style.use(ggplot):

df1['A'].hist()

<Axes: >

../../_images/21ff062ae896d69701faadaa6ce736d7ea4633a294d1a88fa5e6404b5de821c2.png

Now try plt.style.use(bmh):

plt.style.use('bmh')
df1['A'].hist()

<Axes: >

../../_images/204a078e6a639e353bcaf1d7caf67610c4e64b0955024565358358b3eff1efd5.png

fivethirtyeight

plt.style.use('fivethirtyeight')
df1['A'].hist()

<Axes: >

../../_images/49c33b14de86aa734b83d08fd20560a6c36ac427b67304bc5dc08b93b9d46086.png

A dark background theme:

plt.style.use('dark_background')
df1['A'].hist()

<Axes: >

../../_images/704ea2e6196f869168552ffdf32789bd0e9d1c78a6db5cbece8e56e736568f52.png

Let’s stick with the ggplot style for now.

plt.style.use('ggplot')
# plt.style.use('bmh')

8.1.3. Plot Types#

There are several plot types built-in to Pandas, most of them statistical plots by nature:

df.plot.area
df.plot.bar & df.plot.barh
df.plot.density
df.plot.hist
df.plot.line
df.plot.scatter
df.plot.bar
df.plot.box
df.plot.hexbin
df.plot.kde
df.plot.pie

Let’s call these plot type methods (the key terms shown in the list above, e.g. ‘box’, ‘barh’, etc) to see how they work.

### remember the dataframe

df2.head()

	a	b	c	d
0	0.039762	0.218517	0.103423	0.957904
1	0.937288	0.041567	0.899125	0.977680
2	0.780504	0.008948	0.557808	0.797510
3	0.672717	0.247870	0.264071	0.444358
4	0.053829	0.520124	0.552264	0.190008

8.1.3.1. Area Plot#

df2.plot.area(alpha=0.4)

<Axes: >

../../_images/1d36d79209fc6b0ebaaaf21e41c43f887fadd98d8817edaf81c8ce22565a8d4f.png

8.1.3.2. Bar Plots#

Bar plots are one of the most common ways to compare categorical data visually. They represent quantities as rectangular bars whose length (or height) corresponds to the value being measured. In Python, these are typically created using Matplotlib or Seaborn.

Both bar() and barh() create bar charts — the difference is simply orientation. Stacked bar charts, on the other hand, how multiple subcategories contribute to a total within each main category.

Plot Type	Orientation	Purpose	Best For
Bar Plot	Vertical	Compare category values	Simple category comparisons
Barh Plot	Horizontal	Compare category values with long labels	Readability and ranking-type data
Stacked Bar	Either	Show part-to-whole relationships	Composition of totals across categories

8.1.3.2.1. Vertical Bar Plot#

### bar plot

df2.plot.bar()

<Axes: >

../../_images/d4673405b5ab5074f113cddb4f642b345bcdcc03a2ce8a40413693c763216e50.png

8.1.3.2.2. Horizontal Bar Plot#

### bar plot horizontal

df2.plot.barh()

<Axes: >

../../_images/d6222a4597e8945c6df0183dd2143a069f6657b9214a52a022f21a3910fb417b.png

### bar plot stacked

df2.plot.bar(stacked=True)

<Axes: >

../../_images/cb8982e0ea9c40ca22597a0dcbbcf1b05dc8f74f6a759f39f3272cb2f6529946.png

8.1.3.3. Histograms#

df1['A'].plot.hist(bins=100)

<Axes: ylabel='Frequency'>

../../_images/d76e755b14608208a84f5937dfa9935acd1852dba03707877bbb933c6b5fa766.png

8.1.3.4. Line Plots#

# df1.plot.line(x=df1.index, y='B',figsize=(12,3),lw=1)

df1.plot.line(y='B',figsize=(10,6),lw=5)

<Axes: >

../../_images/214f9662121f6cc7bc11d6279ad4205335f20f35768cad67bddfd6ea611fe12b.png

8.1.3.5. Scatter Plots#

df1.plot.scatter(x='A',y='B')

<Axes: xlabel='A', ylabel='B'>

../../_images/d469d74b1421e64c5ebbb1f2c65ba3c2c8f1b94ee20543dabe40af9426357523.png

8.1.4. Color Maps#

You can use c to color based off another column value.
Use cmap to indicate colormap to use.
For all the colormaps, check out: http://matplotlib.org/users/colormaps.html

### the color of each point corresponds to the values in column C
### try them out

# df1.plot.scatter(x=df1['A'], y=df1['B'], c=df1['C'], cmap='coolwarm')  ### need to use x='A' instead of x=df1['A']
df1.plot.scatter(x='A', y='B', c='C', cmap='coolwarm')       ### color value from 'C'; colormap
df1.plot.scatter(x='A', y='B', c='C', cmap='viridis')      ### another colormap

<Axes: xlabel='A', ylabel='B'>

../../_images/2d4ae987b56c01bb1cdc5ad5a3e6d88868c9d08a4655a626888895699661ec8e.png

../../_images/5c8f4360651e18aea1a1d64c6af06d1191b09163ebb627b818842eeabbb449f6.png

8.1.5. Size#

Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:

### try different sizes

# df1.plot.scatter(x='A', y='B', s=df1['C']*200)
df1.plot.scatter(x='A',y='B',s=df1['C']*10)

/Users/tychen/workspace/dsm/.venv/lib/python3.13/site-packages/matplotlib/collections.py:999: RuntimeWarning: invalid value encountered in sqrt
  scale = np.sqrt(self._sizes) * dpi / 72.0 * self._factor

<Axes: xlabel='A', ylabel='B'>

../../_images/ecaece2a69246c4155298315baa30931f2b3693d4c12683050117a1f4d3882f0.png

8.1.5.1. BoxPlots#

df2.plot.box() # Can also pass a by= argument for groupby

<Axes: >

../../_images/e2c20fe165e827cdfc5dca9554779b22a9f7b15a1b8d3f3d9e7804d24aa2bd9a.png

8.1.5.2. Hexagonal Bin Plot#

Useful for Bivariate Data, alternative to scatterplot:

np.random.seed(42)
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a', y='b', gridsize=25, cmap='Oranges')

<Axes: xlabel='a', ylabel='b'>

../../_images/fee4f0509db01a3a42e611a982f8b5b86e5b8cc7a76efa3bea74d3ba1a79efaf.png

8.1.5.3. Kernel Density Estimation plot (KDE)#

A smooth version of histogram

%pip install scipy    ### ensure scipy is installed; uncomment when done

df2['a'].plot.kde()

Collecting scipy
  Using cached scipy-1.16.2-cp313-cp313-macosx_14_0_arm64.whl.metadata (62 kB)
Requirement already satisfied: numpy<2.6,>=1.25.2 in /Users/tychen/workspace/dsm/.venv/lib/python3.13/site-packages (from scipy) (2.3.3)
Using cached scipy-1.16.2-cp313-cp313-macosx_14_0_arm64.whl (20.9 MB)
Installing collected packages: scipy
Successfully installed scipy-1.16.2
Note: you may need to restart the kernel to use updated packages.

<Axes: ylabel='Density'>

../../_images/02f36622e6423688adc2c8e0fcb1c7fe980afb1bf098e4b36b5a7991727c89bd.png

8.1.5.4. Density Plot#

df2.plot.density()

Requirement already satisfied: scipy in /Users/tcn85/workspace/dsm/.venv/lib/python3.13/site-packages (1.16.2)
Requirement already satisfied: numpy<2.6,>=1.25.2 in /Users/tcn85/workspace/dsm/.venv/lib/python3.13/site-packages (from scipy) (2.3.3)
Note: you may need to restart the kernel to use updated packages.

<Axes: ylabel='Density'>

../../_images/e38eabbc2a42e35e07a08dd5910d6e1880d96b3de8501a2d302047e1e03a489b.png

Using density plot as an example of how Pandas visualization differs from Matplotlib and Seaborn:

Library	Function	Notes
Seaborn	`sns.kdeplot()`	most common, easy to use
Pandas	`.plot(kind="density")`	convenient for quick plots
Matplotlib + SciPy	`gaussian_kde()`	manual control over details

The end.