5. Visualization#
Dr. Andrew Abela created this chart chooser visualization in 2013. The four-dimension chart categorization offer great insight when choosing visualizations.
Fig. 5.1 Andrew Abela Chart Chooser#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
Overview
Data visualization is the art and science of transforming raw numbers into compelling visual stories. While spreadsheets and statistical summaries provide precise information, visualizations reveal patterns, relationships, and insights that would remain hidden in tables of data.
Imagine trying to understand weather patterns from thousands of temperature readings versus seeing them plotted as a line graph over time—the visual immediately shows trends, seasonal cycles, and anomalies that numbers alone cannot convey.
Each tool has its strengths: Python excels in programmatic control and integration with data analysis workflows, while R provides statistical visualization excellence, while Tableau and Power BI offer user-friendly interfaces for business users.
Why Visualization is Essential
Data visualization transforms numbers into understanding, serving three fundamental purposes:
Exploration and Discovery: Visualization reveals patterns, outliers, and relationships invisible in raw data, guiding initial analysis and data cleaning decisions.
Communication and Persuasion: Well-crafted visuals convey complex findings to diverse audiences, translating technical results into accessible insights.
Decision Support and Action: Visualization provides clarity for confident decision-making through dashboards, trend analysis, and comparative displays.
Ultimately, visualization bridges the gap between data analysis and actionable insight.
Effective data visualization follows four key principles:
Context is Key: Design for your specific audience and their decision-making needs.
Keep It Simple: Use clear labels and remove unnecessary elements that don’t add value.
Choose the Right Chart Type: Match chart types to data structure:
bars for categories,
lines for trends,
scatter plots for relationships.
Tell a Story: Structure visualizations to guide the audience through a logical narrative flow.
Common Tools and Libraries
The data visualization landscape offers tools ranging from point-and-click platforms to code-based libraries. Understanding this ecosystem helps you choose the right tool for your needs.
Business Intelligence Platforms:
Tableau: Industry-leading dashboard creation with drag-and-drop interface
Power BI: Microsoft’s analytics platform with strong Excel and cloud integration
Python Visualization Libraries:
Matplotlib: The foundational library providing complete control over plot elements
Seaborn: Statistical graphics with simplified syntax, built on Matplotlib. Seaborn is a higher-level library specifically designed for creating more visually appealing and informative statistical graphics with less code. It’s excellent for exploring relationships between variables. Seaborn also requires Matplotlib as a dependency.
Plotly: Interactive, web-ready visualizations including 3D and geographic plots
Pandas: Quick plotting directly from DataFrames for exploratory analysis
Why Focus on Python?
While business tools excel at dashboard creation and user-friendly interfaces, Python libraries offer several advantages for data scientists:
Integration: Seamless workflow from data analysis to visualization
Reproducibility: Code-based plots can be version controlled and automated
Customization: Complete control over every visual element
Cost: Open-source tools reduce licensing expenses
This chapter focuses on Pandas visualization—the fastest way to create exploratory plots directly from your data.
Which Plot to Use?
It is essential to know which plots to use when the need for data visualization arises. Matplotlib (plt/ax), Pandas (df.plot), and Seaborn (.) each have a set of functions and methods for data visualization. You will learn about the functions and methods later, but here is a summary of them. Pay attention to what they are used for for now and use this table as a reference.
Plot Type |
Pandas ( |
Matplotlib ( |
Seaborn ( |
Uses |
|---|---|---|---|---|
Line |
|
|
|
Time series, trends |
Scatter |
|
|
|
Correlations, relationships |
Bar (vert.) |
|
|
|
Category comparisons |
Bar (horiz.) |
|
|
|
Long category labels |
Histogram |
|
|
|
Distribution of numeric data |
Box plot |
|
|
|
Distribution + outliers |
KDE/ |
|
— |
|
Smoothed distribution |
ECDF |
— |
— |
|
cumulative distribution |
Area |
|
|
— |
Cumulative trends |
Stacked area |
|
|
— |
Composition over time |
Pie |
|
|
— |
Part-to-whole (use sparingly) |
Hexbin |
|
|
|
Dense scatter (binning) |
Violin |
— |
|
|
Distribution shape by category |
Heatmap |
— |
|
|
Correlation matrices, grids |
Count plot |
— |
— |
|
Frequency of categories |
Pair plot |
— |
— |
|
Multivariate relationships |
Joint plot |
— |
— |
|
Bivariate + marginals |
Regression |
— |
— |
|
Linear relationships + CI |
Contour |
— |
|
|
Continuous 2D fields/level sets |
Seaborn Datasets
Seaborn comes with several datasets that are commonly used for learning data science and machine learning.
Name |
Rows × Cols |
What it’s about |
Common uses |
|---|---|---|---|
|
244 × 7 |
Restaurant bills & tips |
categorical plots, regression, grouping |
|
344 × 7 |
Palmer penguins (species & measurements) |
scatter, hue, KDE, classification demos |
|
150 × 5 |
Iris flower measurements |
pairplots, clustering, basics |
|
53,940 × 10 |
Diamond prices & attributes |
regression, categorical + numeric |
|
144 × 3 |
Monthly air passengers (’49–’60) |
heatmaps, time series |
|
891 × 15 |
Titanic passengers |
categorical analysis, missing data |
|
1,035 × 6 |
Exoplanet discoveries |
distributions, facet grids |
|
1,064 × 4 |
fMRI signal over time |
lineplots with CIs |
|
90 × 4 |
Exercise & pulse |
catplots, faceting |
|
44 × 3 |
Anscombe’s quartet |
scatter + regression; cautionary stats |
To load the datasets, use the load_dataset function with syntax:
[name] = sns.load_dataset("[dataset]")
Here we would take a quick look at the popular ones.
Iris
The Iris dataset, introduced by Ronald A. Fisher in 1936, is one of the most well-known datasets in statistics and machine learning. It is often used for testing classification and visualization techniques. It contains measurements of three species of iris flowers: Setosa, Versicolor, and Virginica. The features of the iris dataset include:
Feature |
Description |
Units |
|---|---|---|
sepal_length |
Length of the outer part of the flower |
cm |
sepal_width |
Width of the outer part |
cm |
petal_length |
Length of the inner petal |
cm |
petal_width |
Width of the inner petal |
cm |
species |
Type of iris flower (setosa, versicolor, virginica) |
categorical |
### load dataset
iris = sns.load_dataset("iris")
iris.head(3)
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
Tips
The tips dataset records restaurant bills and gratuities along with simple demographics, making it useful for practicing categorical comparisons and relationships between numeric variables.
The tips dataset has features such as:
Feature |
Description |
Type |
|---|---|---|
total_bill |
Total bill amount (including tax) |
float |
tip |
Tip amount given by the customer |
float |
sex |
Gender of the server’s customer (Male, Female) |
categorical |
smoker |
Whether the party included smokers (Yes, No) |
categorical |
day |
Day of the week (Thur, Fri, Sat, Sun) |
categorical |
time |
Meal type (Lunch, Dinner) |
categorical |
size |
Number of people in the dining party |
integer |
tips = sns.load_dataset('tips')
tips.head(3)
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
Titanic
The titanic dataset includes passenger demographics, ticket class, fares, and survival outcome, making it a classic dataset for classification and categorical analysis of the catastrophic event in 1912.
titanic = sns.load_dataset('titanic')
titanic.head(3)
| survived | pclass | sex | age | sibsp | parch | fare | embarked | class | who | adult_male | deck | embark_town | alive | alone | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 3 | male | 22.0 | 1 | 0 | 7.2500 | S | Third | man | True | NaN | Southampton | no | False |
| 1 | 1 | 1 | female | 38.0 | 1 | 0 | 71.2833 | C | First | woman | False | C | Cherbourg | yes | False |
| 2 | 1 | 3 | female | 26.0 | 0 | 0 | 7.9250 | S | Third | woman | False | NaN | Southampton | yes | True |
Planets
The planets dataset contains exoplanet discoveries, including discovery method, orbital period, and mass, and is useful for distributions and time-based summaries.
planets = sns.load_dataset('planets')
planets.head(3)
| method | number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|---|
| 0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
| 1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
| 2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
# Ensure the Titanic dataset is available for preview examples
if "sns" not in globals():
import seaborn as sns
if "titanic" not in globals():
titanic = sns.load_dataset("titanic")
Previewing Data
After you load a new dataset, always use the following methods/property to explore the data:
head()(how the dataset looks like),describe()(descriptive statistics), andshape(get dimension information; or just evaluate the dataframe)
Here is a preview (head()) of the first five rows so you can see the raw values.
print("head(5):")
print(titanic.head(5))
head(5):
survived pclass sex age sibsp parch fare embarked class \
0 0 3 male 22.0 1 0 7.2500 S Third
1 1 1 female 38.0 1 0 71.2833 C First
2 1 3 female 26.0 0 0 7.9250 S Third
3 1 1 female 35.0 1 0 53.1000 S First
4 0 3 male 35.0 0 0 8.0500 S Third
who adult_male deck embark_town alive alone
0 man True NaN Southampton no False
1 woman False C Cherbourg yes False
2 woman False NaN Southampton yes True
3 woman False C Southampton yes False
4 man True NaN Southampton no True
Here is a quick summary of the numeric columns in the Titanic dataset.
print("describe():")
print(titanic.describe())
describe():
survived pclass age sibsp parch fare
count 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200
Here is the dataset shape to confirm the number of rows and columns.
print("shape:", titanic.shape)
shape: (891, 15)
Style Sheets
The style sheet, or themes, basically creates a set of style rules that your plots follow. The use of a stylesheet gives your plots a unified look and feel, making them more professional. You can even create your own stylesheet.
Matplotlib has style sheets (or themes) you can use to make your plots look a little nicer. Popular stylesheets include:
bmh (Bayesian Methods for Hackers)
fivethirtyeight (FiveThirtyEight is a news site)
ggplot (R’s ggplot2 default theme)
dark_background
The syntax for using stylesheets in matplotlib is:
plt.style.use(style_name)
To see all the stylesheets available, use:
plt.style.available
Note that:
we use
plt.style, which means we are using matplotlib here.Pandas by default pulls colors from Matplotlib’s axes.prop_cycle, a Matplotlib rcParam (runtime configuration parameter), which is a color iterator (cycler) that cycles through a list of predefined colors. That’s why you may see different colors (by default, 3) when you plot multiple lines.
Before plt.style.use(), let’s draw this histogram (this should be the default look):
titanic['age'].hist()
<Axes: >
Call the stylesheet. Let’s use the ggplot theme:
plt.style.use('ggplot')
After applying plt.style.use(ggplot):
titanic['age'].hist()
<Axes: >
Now try plt.style.use(bmh):
plt.style.use('bmh')
titanic['age'].hist()
<Axes: >
fivethirtyeight
plt.style.use('fivethirtyeight')
titanic['age'].hist()
<Axes: >
A dark background theme:
plt.style.use('dark_background')
titanic['age'].hist()
<Axes: >
Let’s stick with the ggplot style for now.
plt.style.use('ggplot')