5.1. Pandas Visualization#
# %pip install pandas numpy ### ensure pandas and numpy are installed; uncomment when done
# %pip install matplotlib ### ensure matplotlib is installed; uncomment when done
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt ### because pandas.plot is still matplotlib
# %matplotlib inline ### uncomment if using Jupyter Notebook
### unless using Jupyter < 7
### use plt.show() to display plots in other environments
Pandas offers a simple, high-level interface for creating plots directly from Series and DataFrame objects. It builds on Matplotlib behind the scenes, allowing you to make quick, readable plots with calls like df.plot() or ser.plot(). You can easily produce line, bar/stacked bar, area, scatter, box, histogram, KDE, hexbin, and pie charts while using your index (including datetime) and column labels.
In this section, you will learn:
Plotting*: Plotting different plots using either the
df.plot()method with thekindparameter or the direct plotting methods. For example, for histograms:df.plot.hist(): Uses Pandas’ unified.plot()interface (which wraps Matplotlib) to draw a histogram.df.plot(kind='hist'): The same as df.plot.hist(); a generic form of the same call.Pandas has two top-level plotting methods:
df.hist()(a quick wrapper around Matplotlib’splt.hist()for all numeric columns) anddf.boxplot().
# |
Method |
Applies to |
Subplots |
Overlays Multiple Columns |
Based On |
Typical Use |
|---|---|---|---|---|---|---|
1 |
|
DataFrame only |
Yes (grid of plots) |
No |
|
Quick overview of all numeric columns |
2 |
|
DataFrame or Series |
Single plot |
Yes |
Pandas |
Custom, combined histograms |
Custom parameters: Most plotting calls accept
Matplotlibkeyword arguments and return a Matplotlib Axes object for further customization. Often used parameters from pandas API reference (pandas.DataFrame.plot) include:data: Series or DataFrame. The object for which the method is called.x: label or position, default None. Only used if data is a DataFrame.y: label, position or list of label, positions, default None. Allows plotting of one column versus another. Only used if data is a DataFrame.kind: str. The kind of plot to produce: line (default), bar, barh, hist, box, kde/density, area, pie, scatter (DataFrame only), hexbin (DataFrame only).axfigsizeuse_indextitleticks (xticks and yticks): sequence. Values to use for ticks.
lim (
xlimandylim): 2-tuple/list. Set the limits of the axes.label (
xlabelandylabel: Name to use for the label on axis (default to index name).coloarmap: str or matplotlib colormap objectstacked: bool, default False in line and bar plots, and True in area plot. If True, create stacked plot.**kwargs (matplotlib):
lwalpha
style sheets: Style the plots to look globally with
plt.style.use(...).color & size:
For scatter plots you can color by a column (
c) orsize by an array (
s=df['col']*scale) anduse
cmapfor colormaps.
fine-tuning: For large or dense data, consider hexbin or KDE plots instead of scatter.
Plot Types The
pandas.DataFrame.plotAPI lists 11 plots to be used with thekindparameter: The kind of plot to produce:line(default),bar,barh(horizontal bar plot),hist,box,kde(density),area,pie,scatter,hexbin. These plots can be summarized as:
# |
Method |
Shorthand |
Description |
Best For |
|---|---|---|---|---|
1 |
|
|
Line plot |
Time series, continuous trends |
2 |
|
|
Vertical bar chart |
Comparing categories side by side |
3 |
|
|
Horizontal bar chart |
Long category names, rankings |
4 |
|
|
Histogram |
Distribution of a single variable |
5 |
|
|
Box plot |
Spotting outliers, comparing spread |
6 |
|
|
KDE curve |
Smooth distribution, comparing shapes |
7 |
|
|
Alias for kde |
Same as KDE |
8 |
|
|
Area plot |
Cumulative totals, part-to-whole over time |
9 |
|
|
Pie chart |
Proportions of a whole (few categories) |
10 |
|
|
Scatter plot |
Correlation between two numeric variables |
11 |
|
|
Hexbin plot |
Large datasets, 2D density visualization |
Notes:
kdeanddensityare aliases — same plot, counted once for unique typesscatterandhexbinrequirexandyargumentspierequires a single column (y=or per-column)scatterandhexbinare DataFrame only
Note: Jupyter Notebooks enable inline plotting (no pop-up) with %matplotlib inline, which is an IPython magic command that tells Jupyter Notebook to render Matplotlib plots directly inside the notebook output cell instead of in a separate window. You don’t need %matplotlib inline in Jupyter environment with IPython >= 7.
Later, when you learn Matplotlib, you will see why these methods of plotting are a lot easier to use. Pandas visualization balances ease of use with control over the figure. A lot of the plot calls also accept additional arguments of their parent matplotlib.plt call.
5.1.1. Sample Data#
Let’s create some sample DataFrames to demonstrate the various plotting techniques:
np.random.seed(42)
# Create df1 with columns A, B, C for scatter plots and histograms
df1 = pd.DataFrame({
'A': np.random.randn(1000),
'B': np.random.randn(1000),
'C': np.random.rand(1000)
})
# Create df2 with positive values for stacked area plots
dates = pd.date_range('2020-01-01', periods=100, freq='D')
df2 = pd.DataFrame({
'a': np.random.rand(100),
'b': np.random.rand(100),
'c': np.random.rand(100)
}, index=dates)
print("Sample data created:")
print("df1 shape:", df1.shape)
print("df2 shape:", df2.shape)
Sample data created:
df1 shape: (1000, 3)
df2 shape: (100, 3)
Let’s stick with the ggplot style for now.
plt.style.use('ggplot')
# plt.style.use('bmh')
5.1.2. Plot Types#
Let’s call some of these 11 plot type methods (the key terms shown in the list above, e.g. ‘box’, ‘barh’, etc) to see how they work.
### remember the dataframe
df2.head()
| a | b | c | |
|---|---|---|---|
| 2020-01-01 | 0.302420 | 0.596911 | 0.402406 |
| 2020-01-02 | 0.563408 | 0.999558 | 0.884587 |
| 2020-01-03 | 0.803805 | 0.769449 | 0.895563 |
| 2020-01-04 | 0.137148 | 0.397866 | 0.909175 |
| 2020-01-05 | 0.580699 | 0.827553 | 0.313878 |
5.1.2.1. Area Plot#
df2.plot.area(alpha=0.4)
<Axes: >
5.1.2.2. Bar Plots#
Bar plots are one of the most common ways to compare categorical data visually. They represent quantities as rectangular bars whose length (or height) corresponds to the value being measured. In Python, these are typically created using Matplotlib or Seaborn.
Both bar() and barh() create bar charts — the difference is simply orientation. Stacked bar charts, on the other hand, how multiple subcategories contribute to a total within each main category.
Plot Type |
Orientation |
Purpose |
Best For |
|---|---|---|---|
Bar Plot |
Vertical |
Compare category values |
Simple category comparisons |
Barh Plot |
Horizontal |
Compare category values with long labels |
Readability and ranking-type data |
Stacked Bar |
Either |
Show part-to-whole relationships |
Composition of totals across categories |
5.1.2.2.1. Vertical Bar Plot#
### bar plot
df2.plot.bar()
<Axes: >
5.1.2.2.2. Horizontal Bar Plot#
### bar plot horizontal
df2.plot.barh()
<Axes: >
### bar plot stacked
df2.plot.bar(stacked=True)
<Axes: >
5.1.2.3. Histograms#
df1['A'].plot.hist(bins=100)
<Axes: ylabel='Frequency'>
### EXERCISE 1: Histogram Fundamentals
# Goal: Visualize the distribution of `A` in `df1`.
# Requirements:
# 1) Plot a histogram for `df1['A']` with exactly 50 bins.
# 2) Set `figsize=(8, 4)` and `alpha=0.7`.
# 3) Add a title: "Distribution of A".
# Stretch: Add vertical dashed lines for the mean and median of `A`.
# Your code here
5.1.2.4. Line Plots#
# df1.plot.line(x=df1.index, y='B',figsize=(12,3),lw=1)
df1.plot.line(y='B',figsize=(10,6),lw=5)
<Axes: >
### EXERCISE 2: Line Plot Styling
# Goal: Show trend behavior for one feature.
# Requirements:
# 1) Plot `df1['B']` as a line plot.
# 2) Use `figsize=(8, 4)` and `lw=2`.
# 3) Set x-label to "Index" and y-label to "B values".
# Stretch: Overlay a rolling mean with window=10 on the same axes.
# Your code here
<Axes: >
5.1.2.5. Scatter Plots#
df1.plot.scatter(x='A',y='B')
<Axes: xlabel='A', ylabel='B'>
### EXERCISE 3: Scatter Relationship
# Goal: Examine the relationship between `A` and `B`.
# Requirements:
# 1) Create a scatter plot with `x='A'` and `y='B'`.
# 2) Set `figsize=(7, 5)`.
# 3) Set title to "A vs B".
# Stretch: Add `alpha=0.6` and compare readability.
# Your code here
<Axes: xlabel='A', ylabel='B'>
5.1.3. Color Maps#
You can use c to color based off another column value.
Use cmap to indicate colormap to use.
For all the colormaps, check out: http://matplotlib.org/users/colormaps.html
### the color of each point corresponds to the values in column C
### try them out
# df1.plot.scatter(x=df1['A'], y=df1['B'], c=df1['C'], cmap='coolwarm') ### need to use x='A' instead of x=df1['A']
df1.plot.scatter(x='A', y='B', c='C', cmap='coolwarm') ### color value from 'C'; colormap
df1.plot.scatter(x='A', y='B', c='C', cmap='viridis') ### another colormap
<Axes: xlabel='A', ylabel='B'>
### EXERCISE 4: Color-Encoded Scatter
# Goal: Add a third variable using color.
# Requirements:
# 1) Plot `A` vs `B` from `df1`.
# 2) Use `c='C'` and a non-default colormap (for example, `viridis`).
# 3) Set title to "A vs B colored by C".
# Stretch: Try two colormaps and decide which communicates better.
# Your code here
5.1.4. Size#
Or use s to indicate size based off another column. s parameter needs to be an array, not just the name of a column:
### try different sizes
# df1.plot.scatter(x='A', y='B', s=df1['C']*200)
df1.plot.scatter(x='A',y='B',s=df1['C']*10)
<Axes: xlabel='A', ylabel='B'>
### EXERCISE 5: Size-Encoded Scatter
# Goal: Add a third variable using marker size.
# Requirements:
# 1) Plot `A` vs `B` from `df1`.
# 2) Use point size based on `abs(C) * 60`.
# 3) Set `alpha=0.5` so overlap stays readable.
# Stretch: Compare size encoding to color encoding and note tradeoffs.
# Your code here
5.1.4.1. BoxPlots#
df2.plot.box() # Can also pass a by= argument for groupby
<Axes: >
### EXERCISE 6: Box Plot Comparison
# Goal: Compare spread and outliers across `df2` columns.
# Requirements:
# 1) Create a box plot for all numeric columns in `df2`.
# 2) Use `figsize=(8, 4)`.
# 3) Set title to "Box Plot of df2".
# Stretch: Rotate x tick labels by 30 degrees for readability.
# Your code here
<Axes: >
5.1.4.2. Hexagonal Bin Plot#
Useful for Bivariate Data, alternative to scatterplot:
np.random.seed(42)
df = pd.DataFrame(np.random.randn(1000, 2), columns=['a', 'b'])
df.plot.hexbin(x='a', y='b', gridsize=25, cmap='Oranges')
<Axes: xlabel='a', ylabel='b'>
### EXERCISE 7: Hexbin for Dense Data
# Goal: Visualize dense bivariate data without overplotting.
# Requirements:
# 1) Use the existing `df` with columns `a` and `b`.
# 2) Create a hexbin plot with `gridsize=25`.
# 3) Use a visible colormap (for example, `cmap='Oranges'`).
# Stretch: Change `gridsize` to 15 and 40 and compare detail vs smoothness.
# Your code here
<Axes: xlabel='a', ylabel='b'>
5.1.4.3. Kernel Density Estimation (KDE) Plot#
A smooth version of histogram
%pip install scipy ### ensure scipy is installed; uncomment when done
df2['a'].plot.kde()
Requirement already satisfied: scipy in /home/tychen/workspace/dsm/.venv/lib/python3.10/site-packages (1.15.3)
Requirement already satisfied: numpy<2.5,>=1.23.5 in /home/tychen/workspace/dsm/.venv/lib/python3.10/site-packages (from scipy) (2.2.6)
Note: you may need to restart the kernel to use updated packages.
<Axes: ylabel='Density'>
### EXERCISE 8: KDE for a Single Variable
# Goal: Plot a smooth estimate of one distribution.
# Requirements:
# 1) Plot KDE for `df2['a']`.
# 2) Set `figsize=(8, 4)`.
# 3) Add title "KDE of column a".
# Stretch: Overlay histogram + KDE in one figure for `df2['a']`.
# Your code here
<Axes: ylabel='Density'>
5.1.4.4. Density/KDE Plot#
df2.plot.density()
<Axes: ylabel='Density'>
### EXERCISE 9: Multi-Column Density Plot
# Goal: Compare distributions of all `df2` columns on one chart.
# Requirements:
# 1) Create a density plot for `df2`.
# 2) Set `figsize=(8, 4)`.
# 3) Add title "Density Curves for df2".
# Stretch: Keep only two columns and compare how interpretation changes.
# Your code here
<Axes: ylabel='Density'>
Using density plot as an example of how Pandas visualization differs from Matplotlib and Seaborn:
Library |
Function |
Notes |
|---|---|---|
Seaborn |
|
most common, easy to use |
Pandas |
|
convenient for quick plots |
Matplotlib + SciPy |
|
manual control over details |
The end.