Visualization

5. Visualization#

Dr. Andrew Abela created this chart chooser visualization in 2013. The four-dimension chart categorization offer great insight when choosing visualizations.

../../_images/andrew-abela-chart-chooser.jpg

Fig. 5.1 Andrew Abela Chart Chooser#

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Overview

Data visualization is the art and science of transforming raw numbers into compelling visual stories. While spreadsheets and statistical summaries provide precise information, visualizations reveal patterns, relationships, and insights that would remain hidden in tables of data.

Imagine trying to understand weather patterns from thousands of temperature readings versus seeing them plotted as a line graph over time—the visual immediately shows trends, seasonal cycles, and anomalies that numbers alone cannot convey.

Each tool has its strengths: Python excels in programmatic control and integration with data analysis workflows, while R provides statistical visualization excellence, while Tableau and Power BI offer user-friendly interfaces for business users.

Why Visualization is Essential

Data visualization transforms numbers into understanding, serving three fundamental purposes:

  • Exploration and Discovery: Visualization reveals patterns, outliers, and relationships invisible in raw data, guiding initial analysis and data cleaning decisions.

  • Communication and Persuasion: Well-crafted visuals convey complex findings to diverse audiences, translating technical results into accessible insights.

  • Decision Support and Action: Visualization provides clarity for confident decision-making through dashboards, trend analysis, and comparative displays.

Ultimately, visualization bridges the gap between data analysis and actionable insight.

Effective data visualization follows four key principles:

  • Context is Key: Design for your specific audience and their decision-making needs.

  • Keep It Simple: Use clear labels and remove unnecessary elements that don’t add value.

  • Choose the Right Chart Type: Match chart types to data structure:

    • bars for categories,

    • lines for trends,

    • scatter plots for relationships.

  • Tell a Story: Structure visualizations to guide the audience through a logical narrative flow.

Common Tools and Libraries

The data visualization landscape offers tools ranging from point-and-click platforms to code-based libraries. Understanding this ecosystem helps you choose the right tool for your needs.

Business Intelligence Platforms:

  • Tableau: Industry-leading dashboard creation with drag-and-drop interface

  • Power BI: Microsoft’s analytics platform with strong Excel and cloud integration

Python Visualization Libraries:

  • Matplotlib: The foundational library providing complete control over plot elements

  • Seaborn: Statistical graphics with simplified syntax, built on Matplotlib. Seaborn is a higher-level library specifically designed for creating more visually appealing and informative statistical graphics with less code. It’s excellent for exploring relationships between variables. Seaborn also requires Matplotlib as a dependency.

  • Plotly: Interactive, web-ready visualizations including 3D and geographic plots

  • Pandas: Quick plotting directly from DataFrames for exploratory analysis

Why Focus on Python?

While business tools excel at dashboard creation and user-friendly interfaces, Python libraries offer several advantages for data scientists:

  • Integration: Seamless workflow from data analysis to visualization

  • Reproducibility: Code-based plots can be version controlled and automated

  • Customization: Complete control over every visual element

  • Cost: Open-source tools reduce licensing expenses

This chapter focuses on Pandas visualization—the fastest way to create exploratory plots directly from your data.

Which Plot to Use?

It is essential to know which plots to use when the need for data visualization arises. Matplotlib (plt/ax), Pandas (df.plot), and Seaborn (.) each have a set of functions and methods for data visualization. You will learn about the functions and methods later, but here is a summary of them. Pay attention to what they are used for for now and use this table as a reference.

Plot Type

Pandas (df.plot)

Matplotlib (plt)

Seaborn (sns)

Uses

Line

.line()/.plot()

.plot()

.lineplot()

Time series, trends

Scatter

.scatter(x, y)

.scatter()

.scatterplot()

Correlations, relationships

Bar (vert.)

.bar()

.bar()

.barplot()

Category comparisons

Bar (horiz.)

.barh()

.barh()

.barplot(y, x)

Long category labels

Histogram

.hist()

.hist()

.histplot()

Distribution of numeric data

Box plot

.box()

.boxplot()

.boxplot()

Distribution + outliers

KDE/
Density

.kde()/
.density()

.kdeplot()

Smoothed distribution

ECDF

.ecdf() (v3.8+)

.ecdfplot()

cumulative distribution

Area

.area()

.fill_between()

Cumulative trends

Stacked area

.area(
stacked=True)

.stackplot()

Composition over time

Pie

.pie()(ser)

.pie()

Part-to-whole (use sparingly)

Hexbin

.hexbin(x, y)(df)

.hexbin()

.jointplot(
kind="hex")(fig.)

Dense scatter (binning)

Violin

.violinplot()

.violinplot()

Distribution shape by category

Heatmap

.imshow()/
.pcolormesh()

.heatmap()

Correlation matrices, grids

Count plot

.countplot()

Frequency of categories

Pair plot

.pairplot()(fig.)

Multivariate relationships

Joint plot

.jointplot()(fig.)

Bivariate + marginals

Regression

.regplot()(ax)/
.lmplot()(fig.)

Linear relationships + CI

Contour

.contour()/
.contourf()

.kdeplot(x, y)
(2D density)

Continuous 2D fields/level sets

Seaborn Datasets

Seaborn comes with several datasets that are commonly used for learning data science and machine learning.

Name

Rows × Cols

What it’s about

Common uses

tips

244 × 7

Restaurant bills & tips

categorical plots, regression, grouping

penguins

344 × 7

Palmer penguins (species & measurements)

scatter, hue, KDE, classification demos

iris

150 × 5

Iris flower measurements

pairplots, clustering, basics

diamonds

53,940 × 10

Diamond prices & attributes

regression, categorical + numeric

flights

144 × 3

Monthly air passengers (’49–’60)

heatmaps, time series

titanic

891 × 15

Titanic passengers

categorical analysis, missing data

planets

1,035 × 6

Exoplanet discoveries

distributions, facet grids

fmri

1,064 × 4

fMRI signal over time

lineplots with CIs

exercise

90 × 4

Exercise & pulse

catplots, faceting

anscombe

44 × 3

Anscombe’s quartet

scatter + regression; cautionary stats

To load the datasets, use the load_dataset function with syntax:

[name] = sns.load_dataset("[dataset]")

Here we would take a quick look at the popular ones.

Iris

The Iris dataset, introduced by Ronald A. Fisher in 1936, is one of the most well-known datasets in statistics and machine learning. It is often used for testing classification and visualization techniques. It contains measurements of three species of iris flowers: Setosa, Versicolor, and Virginica. The features of the iris dataset include:

Feature

Description

Units

sepal_length

Length of the outer part of the flower

cm

sepal_width

Width of the outer part

cm

petal_length

Length of the inner petal

cm

petal_width

Width of the inner petal

cm

species

Type of iris flower (setosa, versicolor, virginica)

categorical

### load dataset

iris = sns.load_dataset("iris")
iris.head(3)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa

Tips

The tips dataset records restaurant bills and gratuities along with simple demographics, making it useful for practicing categorical comparisons and relationships between numeric variables.

The tips dataset has features such as:

Feature

Description

Type

total_bill

Total bill amount (including tax)

float

tip

Tip amount given by the customer

float

sex

Gender of the server’s customer (Male, Female)

categorical

smoker

Whether the party included smokers (Yes, No)

categorical

day

Day of the week (Thur, Fri, Sat, Sun)

categorical

time

Meal type (Lunch, Dinner)

categorical

size

Number of people in the dining party

integer

tips = sns.load_dataset('tips')
tips.head(3)
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3

Titanic

The titanic dataset includes passenger demographics, ticket class, fares, and survival outcome, making it a classic dataset for classification and categorical analysis of the catastrophic event in 1912.

titanic = sns.load_dataset('titanic')
titanic.head(3)
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True

Planets

The planets dataset contains exoplanet discoveries, including discovery method, orbital period, and mass, and is useful for distributions and time-based summaries.

planets = sns.load_dataset('planets')
planets.head(3)
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
# Ensure the Titanic dataset is available for preview examples
if "sns" not in globals():
    import seaborn as sns

if "titanic" not in globals():
    titanic = sns.load_dataset("titanic")

Previewing Data

After you load a new dataset, always use the following methods/property to explore the data:

  • head() (how the dataset looks like),

  • describe() (descriptive statistics), and

  • shape (get dimension information; or just evaluate the dataframe)

Here is a preview (head()) of the first five rows so you can see the raw values.

print("head(5):")
print(titanic.head(5))
head(5):
   survived  pclass     sex   age  sibsp  parch     fare embarked  class  \
0         0       3    male  22.0      1      0   7.2500        S  Third   
1         1       1  female  38.0      1      0  71.2833        C  First   
2         1       3  female  26.0      0      0   7.9250        S  Third   
3         1       1  female  35.0      1      0  53.1000        S  First   
4         0       3    male  35.0      0      0   8.0500        S  Third   

     who  adult_male deck  embark_town alive  alone  
0    man        True  NaN  Southampton    no  False  
1  woman       False    C    Cherbourg   yes  False  
2  woman       False  NaN  Southampton   yes   True  
3  woman       False    C  Southampton   yes  False  
4    man        True  NaN  Southampton    no   True  

Here is a quick summary of the numeric columns in the Titanic dataset.

print("describe():")
print(titanic.describe())
describe():
         survived      pclass         age       sibsp       parch        fare
count  891.000000  891.000000  714.000000  891.000000  891.000000  891.000000
mean     0.383838    2.308642   29.699118    0.523008    0.381594   32.204208
std      0.486592    0.836071   14.526497    1.102743    0.806057   49.693429
min      0.000000    1.000000    0.420000    0.000000    0.000000    0.000000
25%      0.000000    2.000000   20.125000    0.000000    0.000000    7.910400
50%      0.000000    3.000000   28.000000    0.000000    0.000000   14.454200
75%      1.000000    3.000000   38.000000    1.000000    0.000000   31.000000
max      1.000000    3.000000   80.000000    8.000000    6.000000  512.329200

Here is the dataset shape to confirm the number of rows and columns.

print("shape:", titanic.shape)
shape: (891, 15)

Style Sheets

The style sheet, or themes, basically creates a set of style rules that your plots follow. The use of a stylesheet gives your plots a unified look and feel, making them more professional. You can even create your own stylesheet.

Matplotlib has style sheets (or themes) you can use to make your plots look a little nicer. Popular stylesheets include:

  • bmh (Bayesian Methods for Hackers)

  • fivethirtyeight (FiveThirtyEight is a news site)

  • ggplot (R’s ggplot2 default theme)

  • dark_background

The syntax for using stylesheets in matplotlib is:

plt.style.use(style_name)

To see all the stylesheets available, use:

plt.style.available

Note that:

  • we use plt.style, which means we are using matplotlib here.

  • Pandas by default pulls colors from Matplotlib’s axes.prop_cycle, a Matplotlib rcParam (runtime configuration parameter), which is a color iterator (cycler) that cycles through a list of predefined colors. That’s why you may see different colors (by default, 3) when you plot multiple lines.

Before plt.style.use(), let’s draw this histogram (this should be the default look):

titanic['age'].hist()
<Axes: >
../../_images/10e2d506f77c1d60fce71b623664d736ceb266aff3b2de1e21fb82f19288b846.png

Call the stylesheet. Let’s use the ggplot theme:

plt.style.use('ggplot')

After applying plt.style.use(ggplot):

titanic['age'].hist()
<Axes: >
../../_images/bae5ecf7177e515bd3f6a0141a52e31ee3f5daa4186c2ddb367bbf4ba4a2c67f.png

Now try plt.style.use(bmh):

plt.style.use('bmh')
titanic['age'].hist()
<Axes: >
../../_images/e16b239475fee80320ad395c4231790fddd702b207625873fb5b3f1f470c37c4.png

fivethirtyeight

plt.style.use('fivethirtyeight')
titanic['age'].hist()
<Axes: >
../../_images/f4ba3d10207120e1f9c336de7a71f830f843b9ae2a9ea4044e5779558d9cd8ac.png

A dark background theme:

plt.style.use('dark_background')
titanic['age'].hist()
<Axes: >
../../_images/0d611077b82716171a7632aa0e3fe2b3e57b4da3ec8bff5623a08022bf107947.png

Let’s stick with the ggplot style for now.

plt.style.use('ggplot')