5.1. Pandas Visualization#

# %pip install pandas numpy     ### ensure pandas and numpy are installed; uncomment when done
# %pip install matplotlib       ### ensure matplotlib is installed; uncomment when done

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt ### because pandas.plot is still matplotlib
# %matplotlib inline            ### uncomment if using Jupyter Notebook
                                ### unless using Jupyter < 7
                                ### use plt.show() to display plots in other environments  

Pandas offers a simple, high-level interface for creating plots directly from Series and DataFrame objects. It builds on Matplotlib behind the scenes, allowing you to make quick, readable plots with calls like df.plot() or ser.plot(). You can easily produce line, bar/stacked bar, area, scatter, box, histogram, KDE, hexbin, and pie charts while using your index (including datetime) and column labels.

In this section, you will learn:

  1. Plotting*: Plotting different plots using either the df.plot() method with the kind parameter or the direct plotting methods. For example, for histograms:

    1. df.plot.hist(): Uses Pandas’ unified .plot() interface (which wraps Matplotlib) to draw a histogram.

    2. df.plot(kind='hist'): The same as df.plot.hist(); a generic form of the same call.

    3. Pandas has two top-level plotting methods: df.hist() (a quick wrapper around Matplotlib’s plt.hist() for all numeric columns) and df.boxplot().

#

Method

Applies to

Subplots

Overlays Multiple Columns

Based On

Typical Use

1

df.hist()

DataFrame only

Yes (grid of plots)

No

plt.hist()

Quick overview of all numeric columns

2

df.plot.hist()

DataFrame or Series

Single plot

Yes

Pandas .plot() wrapper

Custom, combined histograms

  1. Custom parameters: Most plotting calls accept Matplotlib keyword arguments and return a Matplotlib Axes object for further customization. Often used parameters from pandas API reference (pandas.DataFrame.plot) include:

    1. data: Series or DataFrame. The object for which the method is called.

    2. x: label or position, default None. Only used if data is a DataFrame.

    3. y: label, position or list of label, positions, default None. Allows plotting of one column versus another. Only used if data is a DataFrame.

    4. kind: str. The kind of plot to produce: line (default), bar, barh, hist, box, kde/density, area, pie, scatter (DataFrame only), hexbin (DataFrame only).

    5. ax

    6. figsize

    7. use_index

    8. title

    9. ticks (xticks and yticks): sequence. Values to use for ticks.

    10. lim (xlim and ylim): 2-tuple/list. Set the limits of the axes.

    11. label (xlabel and ylabel: Name to use for the label on axis (default to index name).

    12. coloarmap: str or matplotlib colormap object

    13. stacked: bool, default False in line and bar plots, and True in area plot. If True, create stacked plot.

    14. **kwargs (matplotlib):

      1. lw

      2. alpha

  2. style sheets: Style the plots to look globally with plt.style.use(...).

  3. color & size:

    • For scatter plots you can color by a column (c) or

    • size by an array (s=df['col']*scale) and

    • use cmap for colormaps.

  4. fine-tuning: For large or dense data, consider hexbin or KDE plots instead of scatter.

  5. Plot Types The pandas.DataFrame.plot API lists 11 plots to be used with the kind parameter: The kind of plot to produce: line(default), bar, barh (horizontal bar plot), hist, box, kde (density), area, pie, scatter, hexbin. These plots can be summarized as:

#

Method

Shorthand

Description

Best For

1

.plot(kind='line')

.plot.line()

Line plot

Time series, continuous trends

2

.plot(kind='bar')

.plot.bar()

Vertical bar chart

Comparing categories side by side

3

.plot(kind='barh')

.plot.barh()

Horizontal bar chart

Long category names, rankings

4

.plot(kind='hist')

.plot.hist()

Histogram

Distribution of a single variable

5

.plot(kind='box')

.plot.box()

Box plot

Spotting outliers, comparing spread

6

.plot(kind='kde')

.plot.kde()

KDE curve

Smooth distribution, comparing shapes

7

.plot(kind='density')

.plot.density()

Alias for kde

Same as KDE

8

.plot(kind='area')

.plot.area()

Area plot

Cumulative totals, part-to-whole over time

9

.plot(kind='pie')

.plot.pie()

Pie chart

Proportions of a whole (few categories)

10

.plot(kind='scatter', x='col1', y='col2')

.plot.scatter(x='col1', y='col2')

Scatter plot

Correlation between two numeric variables

11

.plot(kind='hexbin', x='col1', y='col2')

.plot.hexbin(x='col1', y='col2')

Hexbin plot

Large datasets, 2D density visualization

Notes:

  • kde and density are aliases — same plot, counted once for unique types

  • scatter and hexbin require x and y arguments

  • pie requires a single column (y= or per-column)

  • scatter and hexbin are DataFrame only

Note: Jupyter Notebooks enable inline plotting (no pop-up) with %matplotlib inline, which is an IPython magic command that tells Jupyter Notebook to render Matplotlib plots directly inside the notebook output cell instead of in a separate window. You don’t need %matplotlib inline in Jupyter environment with IPython >= 7.

Later, when you learn Matplotlib, you will see why these methods of plotting are a lot easier to use. Pandas visualization balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib.plt call.

5.1.1. Sample Data#

Let’s create some sample DataFrames to demonstrate the various plotting techniques:

np.random.seed(42)

# Create df1 with columns A, B, C for scatter plots and histograms
df1 = pd.DataFrame({
    'A': np.random.randn(1000),
    'B': np.random.randn(1000),
    'C': np.random.rand(1000)
})

# Create df2 with positive values for stacked area plots
dates = pd.date_range('2020-01-01', periods=100, freq='D')
df2 = pd.DataFrame({
    'a': np.random.rand(100),
    'b': np.random.rand(100),
    'c': np.random.rand(100)
}, index=dates)

print("Sample data created:")
print("df1 shape:", df1.shape)
print("df2 shape:", df2.shape)
Sample data created:
df1 shape: (1000, 3)
df2 shape: (100, 3)

Let’s stick with the ggplot style for now.

plt.style.use('ggplot')
# plt.style.use('bmh')

5.1.2. Plot Types#

Let’s call some of these 11 plot type methods (the key terms shown in the list above, e.g. ‘box’, ‘barh’, etc) to see how they work.

### remember the dataframe

df2.head()
a b c
2020-01-01 0.302420 0.596911 0.402406
2020-01-02 0.563408 0.999558 0.884587
2020-01-03 0.803805 0.769449 0.895563
2020-01-04 0.137148 0.397866 0.909175
2020-01-05 0.580699 0.827553 0.313878

5.1.2.1. Area Plot#

df2.plot.area(alpha=0.4)
<Axes: >
../../_images/21e624af0618800f1875929ebd046d7250b30fb30695472fb5e9e379a2c30242.png

5.1.2.2. Bar Plots#

Bar plots are one of the most common ways to compare categorical data visually. They represent quantities as rectangular bars whose length (or height) corresponds to the value being measured. In Python, these are typically created using Matplotlib or Seaborn.

Both bar() and barh() create bar charts — the difference is simply orientation. Stacked bar charts, on the other hand, how multiple subcategories contribute to a total within each main category.

Plot Type

Orientation

Purpose

Best For

Bar Plot

Vertical

Compare category values

Simple category comparisons

Barh Plot

Horizontal

Compare category values with long labels

Readability and ranking-type data

Stacked Bar

Either

Show part-to-whole relationships

Composition of totals across categories

5.1.2.2.1. Vertical Bar Plot#

### bar plot

df2.plot.bar()
<Axes: >
../../_images/9c907a02a27d6131ce79cb0f72ed90488a03b93c1520610b6f221b916cb1c17f.png

5.1.2.2.2. Horizontal Bar Plot#

### bar plot horizontal

df2.plot.barh()
<Axes: >
../../_images/f832d185186dd93d2e94be1921535e617796f67e8dd8dfc821cf5a0828f45451.png
### bar plot stacked

df2.plot.bar(stacked=True)
<Axes: >
../../_images/6dfc390915e17628a9c7ba55a83dc8e8fe56e59cb3f20f8ee6ad89139b0d0120.png

5.1.2.3. Histograms#

df1['A'].plot.hist(bins=100)
<Axes: ylabel='Frequency'>
../../_images/3a5ad20bc9ed00e446a74ffa202cd9131899fe680a94cf3ca56eda0406be5ea4.png
### EXERCISE 1: Histogram Fundamentals
# Goal: Visualize the distribution of `A` in `df1`.
# Requirements:
# 1) Plot a histogram for `df1['A']` with exactly 50 bins.
# 2) Set `figsize=(8, 4)` and `alpha=0.7`.
# 3) Add a title: "Distribution of A".
# Stretch: Add vertical dashed lines for the mean and median of `A`.

# Your code here

Hide code cell source

# Solution
import matplotlib.pyplot as plt
df1['A'].plot.hist(bins=50)
plt.show()
../../_images/b4f43435dff561375c7b000aae30f5d593604e9122d6d27bd0e04eeb5e8b003f.png

5.1.2.4. Line Plots#

# df1.plot.line(x=df1.index, y='B',figsize=(12,3),lw=1)

df1.plot.line(y='B',figsize=(10,6),lw=5)
<Axes: >
../../_images/5572063e448e818278b506ec05d8d37d5ed867bf3fe62eff3d568ad48555d787.png
### EXERCISE 2: Line Plot Styling
# Goal: Show trend behavior for one feature.
# Requirements:
# 1) Plot `df1['B']` as a line plot.
# 2) Use `figsize=(8, 4)` and `lw=2`.
# 3) Set x-label to "Index" and y-label to "B values".
# Stretch: Overlay a rolling mean with window=10 on the same axes.

# Your code here

Hide code cell source

# Solution
df1.plot.line(y='B', figsize=(8,4), lw=2)
<Axes: >
../../_images/96dbb1f0052b64150b129fb7670f31e28aefd6a94be6b717102accbc103c8ae7.png

5.1.2.5. Scatter Plots#

df1.plot.scatter(x='A',y='B')
<Axes: xlabel='A', ylabel='B'>
../../_images/cce06064f442c6348bcd807077252d1019db811ce5c4f244c4f6f6f390f77ebd.png
### EXERCISE 3: Scatter Relationship
# Goal: Examine the relationship between `A` and `B`.
# Requirements:
# 1) Create a scatter plot with `x='A'` and `y='B'`.
# 2) Set `figsize=(7, 5)`.
# 3) Set title to "A vs B".
# Stretch: Add `alpha=0.6` and compare readability.

# Your code here

Hide code cell source

# Solution
df1.plot.scatter(x='A', y='B')
<Axes: xlabel='A', ylabel='B'>
../../_images/cce06064f442c6348bcd807077252d1019db811ce5c4f244c4f6f6f390f77ebd.png

5.1.3. Color Maps#

### the color of each point corresponds to the values in column C
### try them out

# df1.plot.scatter(x=df1['A'], y=df1['B'], c=df1['C'], cmap='coolwarm')  ### need to use x='A' instead of x=df1['A']
df1.plot.scatter(x='A', y='B', c='C', cmap='coolwarm')       ### color value from 'C'; colormap
df1.plot.scatter(x='A', y='B', c='C', cmap='viridis')      ### another colormap
<Axes: xlabel='A', ylabel='B'>
../../_images/ad56275e7c9dd6c6c6984b4655943857b8f30a1bcb971b511ef33eb9ddb49827.png ../../_images/1a74317f724860494a5a0567b2784e19053889d6e08c18fe1d83aca146f27d75.png
### EXERCISE 4: Color-Encoded Scatter
# Goal: Add a third variable using color.
# Requirements:
# 1) Plot `A` vs `B` from `df1`.
# 2) Use `c='C'` and a non-default colormap (for example, `viridis`).
# 3) Set title to "A vs B colored by C".
# Stretch: Try two colormaps and decide which communicates better.

# Your code here

5.1.4. Size#

Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:

### try different sizes

# df1.plot.scatter(x='A', y='B', s=df1['C']*200)
df1.plot.scatter(x='A',y='B',s=df1['C']*10)
<Axes: xlabel='A', ylabel='B'>
../../_images/759a3074a2203a1b3963a63a8bbaecce95770357ec0345a2430964b3263fffc3.png
### EXERCISE 5: Size-Encoded Scatter
# Goal: Add a third variable using marker size.
# Requirements:
# 1) Plot `A` vs `B` from `df1`.
# 2) Use point size based on `abs(C) * 60`.
# 3) Set `alpha=0.5` so overlap stays readable.
# Stretch: Compare size encoding to color encoding and note tradeoffs.

# Your code here

5.1.4.1. BoxPlots#

df2.plot.box() # Can also pass a by= argument for groupby
<Axes: >
../../_images/4059a0d52033c26464131659ae7dee0ec4cf5a82322f522cf9ed8e57ad5ea7b2.png
### EXERCISE 6: Box Plot Comparison
# Goal: Compare spread and outliers across `df2` columns.
# Requirements:
# 1) Create a box plot for all numeric columns in `df2`.
# 2) Use `figsize=(8, 4)`.
# 3) Set title to "Box Plot of df2".
# Stretch: Rotate x tick labels by 30 degrees for readability.

# Your code here

Hide code cell source

# Solution
df2.plot.box()
<Axes: >
../../_images/4059a0d52033c26464131659ae7dee0ec4cf5a82322f522cf9ed8e57ad5ea7b2.png

5.1.4.2. Hexagonal Bin Plot#

Useful for Bivariate Data, alternative to scatterplot:

np.random.seed(42)
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a', y='b', gridsize=25, cmap='Oranges')
<Axes: xlabel='a', ylabel='b'>
../../_images/98c628d8f6a74812341654f680e819f5fb966faefba2092420bbd0948e03672c.png
### EXERCISE 7: Hexbin for Dense Data
# Goal: Visualize dense bivariate data without overplotting.
# Requirements:
# 1) Use the existing `df` with columns `a` and `b`.
# 2) Create a hexbin plot with `gridsize=25`.
# 3) Use a visible colormap (for example, `cmap='Oranges'`).
# Stretch: Change `gridsize` to 15 and 40 and compare detail vs smoothness.

# Your code here

Hide code cell source

# Solution
df.plot.hexbin(x='a', y='b', gridsize=25, cmap='Oranges')
<Axes: xlabel='a', ylabel='b'>
../../_images/98c628d8f6a74812341654f680e819f5fb966faefba2092420bbd0948e03672c.png

5.1.4.3. Kernel Density Estimation (KDE) Plot#

  • A smooth version of histogram

%pip install scipy    ### ensure scipy is installed; uncomment when done

df2['a'].plot.kde()
Requirement already satisfied: scipy in /home/tychen/workspace/dsm/.venv/lib/python3.10/site-packages (1.15.3)
Requirement already satisfied: numpy<2.5,>=1.23.5 in /home/tychen/workspace/dsm/.venv/lib/python3.10/site-packages (from scipy) (2.2.6)
Note: you may need to restart the kernel to use updated packages.
<Axes: ylabel='Density'>
../../_images/3d593cc0e06e0b310f80ba8638fe4ec76817b0174cdd7e687eb34c0509432e0b.png
### EXERCISE 8: KDE for a Single Variable
# Goal: Plot a smooth estimate of one distribution.
# Requirements:
# 1) Plot KDE for `df2['a']`.
# 2) Set `figsize=(8, 4)`.
# 3) Add title "KDE of column a".
# Stretch: Overlay histogram + KDE in one figure for `df2['a']`.

# Your code here

Hide code cell source

# Solution
df2['a'].plot.kde()
<Axes: ylabel='Density'>
../../_images/3d593cc0e06e0b310f80ba8638fe4ec76817b0174cdd7e687eb34c0509432e0b.png

5.1.4.4. Density/KDE Plot#

df2.plot.density()
<Axes: ylabel='Density'>
../../_images/102f1fe3ebdfe347afc831ea1f8058571c3dea5a93b47c9a3c6a58573aa576f4.png
### EXERCISE 9: Multi-Column Density Plot
# Goal: Compare distributions of all `df2` columns on one chart.
# Requirements:
# 1) Create a density plot for `df2`.
# 2) Set `figsize=(8, 4)`.
# 3) Add title "Density Curves for df2".
# Stretch: Keep only two columns and compare how interpretation changes.

# Your code here

Hide code cell source

# Solution
df2.plot.density()
<Axes: ylabel='Density'>
../../_images/102f1fe3ebdfe347afc831ea1f8058571c3dea5a93b47c9a3c6a58573aa576f4.png

Using density plot as an example of how Pandas visualization differs from Matplotlib and Seaborn:

Library

Function

Notes

Seaborn

sns.kdeplot()

most common, easy to use

Pandas

.plot(kind="density")

convenient for quick plots

Matplotlib + SciPy

gaussian_kde()

manual control over details

The end.