10. Visualization#

10.1. Overview#

Data visualization is the graphical representation of data. It transforms complex datasets into visual formats that make patterns, trends, and outliers easier to identify. In Python, several libraries support this process—Matplotlib, Seaborn, and Plotly are among the most widely used. Also, Pandas visualization lets you quickly create common charts directly from Series and DataFrames objects, enabling fast, label-aware exploratory plots with minimal code.

Think of visualization as creating a map for your data. A raw list of coordinates is difficult to interpret, but a map reveals structure, relationships, and direction. Likewise, visualization turns raw data into an intuitive visual story.

Why Visualization Matters? Effective visualization serves three main purposes:

  • Exploratory Data Analysis (EDA): Reveals structure, relationships, and unexpected insights.

  • Communication: Conveys complex results to non-technical audiences clearly.

  • Decision-Making: Provides visual evidence to guide business or research choices.

Effective data visualization is vital for presenting complex information in a way that is clear, accurate, and easy to interpret. Well-designed visuals do more than display data—they reveal patterns, support sound decision-making, and make insights accessible to a broad audience. By following a set of guiding principles, analysts can create visualizations that not only inform but also engage and persuade. The following sections outline the key principles of effective data visualization.

1. Context is Key: Design your visualizations for the people who will use them.

  • User Personas: Identify who the audience is, their expertise, and what decisions they need to make.

  • Context: Consider where and how they will view the visualization, and tailor your design to answer their key questions. Context is essential for interpreting data accurately

2. Keep It Simple: Simplicity enhances clarity. Overly complex visuals obscure meaning.

  • Clarity: Use clear labels, concise titles, and avoid technical jargon.

  • Minimalism: Remove any elements that do not add value to understanding the data. Display only the most relevant data for your message.

3. Choose the Right Chart Type: Different data structures call for different chart types:

  • Bar Charts: Compare categories or discrete groups.

  • Line Charts: Show trends or changes over time.

  • Pie Charts: Display proportions (use sparingly).

  • Scatter Plots: Reveal relationships between two continuous variables.

4. Tell a Story: A strong visualization communicates a clear narrative. Effective data visualizations should tell a story that guides the audience through the data. Consider the following:

  • Narrative Flow: Structure the visualization to lead the audience from one insight to the next, creating a logical progression.

  • Engagement: Use visual storytelling techniques, such as highlighting trends or changes over time, to engage the audience and make the data more relatable.

10.2. Common Tools and Libraries#

Data visualization is a core part of data analysis, helping analysts interpret and communicate complex datasets through visual means. A wide range of tools and libraries—both graphical and code-based—support the creation of effective visualizations.

Tableau is a leading platform for building interactive, shareable dashboards. Its drag-and-drop interface and support for multiple data sources make it popular among both beginners and professionals.

Power BI by Microsoft is a business analytics tool that integrates seamlessly with Excel, SQL Server, and cloud services. It is widely used in corporate environments for creating reports and dashboards with minimal coding.

While preparatory tools such as Tableau and Power BI are powerful and popular tools for data analysis, visualization libraries such as Matplotlib, Seaborn, and Plotly are widely adopted as part of the Python data science ecosystem.

Matplotlib is a foundational Python library for producing static, animated, and interactive plots.
It works closely with NumPy and Pandas and forms the base for many higher-level visualization tools.

Seaborn, built on Matplotlib, offers a simpler interface for creating attractive statistical graphics such as heatmaps and violin plots.
It is ideal for visualizing relationships and distributions in data.

Plotly supports Python, R, and JavaScript and excels at building interactive, web-based visualizations, including 3D and geographic plots.

Each tool has unique strengths. Proprietary tools such as Tableau and Power BI emphasize ease of use and dashboard creation. Python visualization libraries such as Matplotlib, Seaborn, Pandas plotting, and Plotly provide flexibility and analytical depth. In this chapter, we will focus on the Python visualization libraries, starting with the Pandas built-in visualization tools based on Matplotlib.

10.3. Which Plot to Use?#

It is essential to know which plots to use when the need for data visualization arises. Matplotlib (plt/ax), Pandas (df.plot), and Seaborn (.) each have a set of functions and methods for data visualization. You will learn about the functions and methods later, but here is a summary of them. Pay attention to what they are used for for now.

Plot Type

Pandas (df.plot)

Matplotlib (plt)

Seaborn (sns)

Uses

Line

.line()/.plot()

.plot()

.lineplot()

Time series, trends

Scatter

.scatter(x, y)

.scatter()

.scatterplot()

Correlations, relationships

Bar (vert.)

.bar()

.bar()

.barplot()

Category comparisons

Bar (horiz.)

.barh()

.barh()

.barplot(y, x)

Long category labels

Histogram

.hist()

.hist()

.histplot()

Distribution of numeric data

Box plot

.box()

.boxplot()

.boxplot()

Distribution + outliers

KDE/
Density

.kde()/
.density()

.kdeplot()

Smoothed distribution

ECDF

.ecdf() (v3.8+)

.ecdfplot()

cumulative distribution

Area

.area()

.fill_between()

Cumulative trends

Stacked area

.area(
stacked=True)

.stackplot()

Composition over time

Pie

.pie()(ser)

.pie()

Part-to-whole (use sparingly)

Hexbin

.hexbin(x, y)(df)

.hexbin()

.jointplot(
kind="hex")(fig.)

Dense scatter (binning)

Violin

.violinplot()

.violinplot()

Distribution shape by category

Heatmap

.imshow()/
.pcolormesh()

.heatmap()

Correlation matrices, grids

Count plot

.countplot()

Frequency of categories

Pair plot

.pairplot()(fig.)

Multivariate relationships

Joint plot

.jointplot()(fig.)

Bivariate + marginals

Regression

.regplot()(ax)/
.lmplot()(fig.)

Linear relationships + CI

Contour

.contour()/
.contourf()

.kdeplot(x, y)
(2D density)

Continuous 2D fields/level sets

10.4. Seaborn Datasets#

Seaborn comes with several datasets that are commonly used for learning data science and machine learning.

Name

Rows × Cols

What it’s about

Common uses

tips

244 × 7

Restaurant bills & tips

categorical plots, regression, grouping

penguins

344 × 7

Palmer penguins (species & measurements)

scatter, hue, KDE, classification demos

iris

150 × 5

Iris flower measurements

pairplots, clustering, basics

diamonds

53,940 × 10

Diamond prices & attributes

regression, categorical + numeric

flights

144 × 3

Monthly air passengers (’49–’60)

heatmaps, time series

titanic

891 × 15

Titanic passengers

categorical analysis, missing data

planets

1,035 × 6

Exoplanet discoveries

distributions, facet grids

fmri

1,064 × 4

fMRI signal over time

lineplots with CIs

exercise

90 × 4

Exercise & pulse

catplots, faceting

anscombe

44 × 3

Anscombe’s quartet

scatter + regression; cautionary stats

To load the datasets, use

10.4.1. Iris#

The Iris dataset, introduced by Ronald A. Fisher in 1936, is one of the most well-known datasets in statistics and machine learning. It is often used for testing classification and visualization techniques. It contains measurements of three species of iris flowers: Setosa, Versicolor, and Virginica. The features of the iris dataset include:

Feature

Description

Units

sepal_length

Length of the outer part of the flower

cm

sepal_width

Width of the outer part

cm

petal_length

Length of the inner petal

cm

petal_width

Width of the inner petal

cm

species

Type of iris flower (setosa, versicolor, virginica)

categorical

### load dataset

iris = sns.load_dataset("iris")
iris.head()
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa

10.4.2. Tips#

The tips dataset has features such as:

Feature

Description

Type

total_bill

Total bill amount (including tax)

float

tip

Tip amount given by the customer

float

sex

Gender of the server’s customer (Male, Female)

categorical

smoker

Whether the party included smokers (Yes, No)

categorical

day

Day of the week (Thur, Fri, Sat, Sun)

categorical

time

Meal type (Lunch, Dinner)

categorical

size

Number of people in the dining party

integer

tips = sns.load_dataset('tips')
tips.head()
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4

10.4.3. Planets#

planets = sns.load_dataset('planets')
planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009