10. Visualization#
10.1. Overview#
Data visualization is the graphical representation of data. It transforms complex datasets into visual formats that make patterns, trends, and outliers easier to identify. In Python, several libraries support this process—Matplotlib, Seaborn, and Plotly are among the most widely used. Also, Pandas visualization lets you quickly create common charts directly from Series and DataFrames objects, enabling fast, label-aware exploratory plots with minimal code.
Think of visualization as creating a map for your data. A raw list of coordinates is difficult to interpret, but a map reveals structure, relationships, and direction. Likewise, visualization turns raw data into an intuitive visual story.
Why Visualization Matters? Effective visualization serves three main purposes:
Exploratory Data Analysis (EDA): Reveals structure, relationships, and unexpected insights.
Communication: Conveys complex results to non-technical audiences clearly.
Decision-Making: Provides visual evidence to guide business or research choices.
Effective data visualization is vital for presenting complex information in a way that is clear, accurate, and easy to interpret. Well-designed visuals do more than display data—they reveal patterns, support sound decision-making, and make insights accessible to a broad audience. By following a set of guiding principles, analysts can create visualizations that not only inform but also engage and persuade. The following sections outline the key principles of effective data visualization.
1. Context is Key: Design your visualizations for the people who will use them.
User Personas: Identify who the audience is, their expertise, and what decisions they need to make.
Context: Consider where and how they will view the visualization, and tailor your design to answer their key questions. Context is essential for interpreting data accurately
2. Keep It Simple: Simplicity enhances clarity. Overly complex visuals obscure meaning.
Clarity: Use clear labels, concise titles, and avoid technical jargon.
Minimalism: Remove any elements that do not add value to understanding the data. Display only the most relevant data for your message.
3. Choose the Right Chart Type: Different data structures call for different chart types:
Bar Charts: Compare categories or discrete groups.
Line Charts: Show trends or changes over time.
Pie Charts: Display proportions (use sparingly).
Scatter Plots: Reveal relationships between two continuous variables.
4. Tell a Story: A strong visualization communicates a clear narrative. Effective data visualizations should tell a story that guides the audience through the data. Consider the following:
Narrative Flow: Structure the visualization to lead the audience from one insight to the next, creating a logical progression.
Engagement: Use visual storytelling techniques, such as highlighting trends or changes over time, to engage the audience and make the data more relatable.
10.2. Common Tools and Libraries#
Data visualization is a core part of data analysis, helping analysts interpret and communicate complex datasets through visual means. A wide range of tools and libraries—both graphical and code-based—support the creation of effective visualizations.
Tableau is a leading platform for building interactive, shareable dashboards. Its drag-and-drop interface and support for multiple data sources make it popular among both beginners and professionals.
Power BI by Microsoft is a business analytics tool that integrates seamlessly with Excel, SQL Server, and cloud services. It is widely used in corporate environments for creating reports and dashboards with minimal coding.
While preparatory tools such as Tableau and Power BI are powerful and popular tools for data analysis, visualization libraries such as Matplotlib, Seaborn, and Plotly are widely adopted as part of the Python data science ecosystem.
Matplotlib is a foundational Python library for producing static, animated, and interactive plots.
It works closely with NumPy and Pandas and forms the base for many higher-level visualization tools.
Seaborn, built on Matplotlib, offers a simpler interface for creating attractive statistical graphics such as heatmaps and violin plots.
It is ideal for visualizing relationships and distributions in data.
Plotly supports Python, R, and JavaScript and excels at building interactive, web-based visualizations, including 3D and geographic plots.
Each tool has unique strengths. Proprietary tools such as Tableau and Power BI emphasize ease of use and dashboard creation. Python visualization libraries such as Matplotlib, Seaborn, Pandas plotting, and Plotly provide flexibility and analytical depth. In this chapter, we will focus on the Python visualization libraries, starting with the Pandas built-in visualization tools based on Matplotlib.
10.3. Which Plot to Use?#
It is essential to know which plots to use when the need for data visualization arises. Matplotlib (plt/ax), Pandas (df.plot), and Seaborn (.) each have a set of functions and methods for data visualization. You will learn about the functions and methods later, but here is a summary of them. Pay attention to what they are used for for now.
Plot Type |
Pandas ( |
Matplotlib ( |
Seaborn ( |
Uses |
|---|---|---|---|---|
Line |
|
|
|
Time series, trends |
Scatter |
|
|
|
Correlations, relationships |
Bar (vert.) |
|
|
|
Category comparisons |
Bar (horiz.) |
|
|
|
Long category labels |
Histogram |
|
|
|
Distribution of numeric data |
Box plot |
|
|
|
Distribution + outliers |
KDE/ |
|
— |
|
Smoothed distribution |
ECDF |
— |
— |
|
cumulative distribution |
Area |
|
|
— |
Cumulative trends |
Stacked area |
|
|
— |
Composition over time |
Pie |
|
|
— |
Part-to-whole (use sparingly) |
Hexbin |
|
|
|
Dense scatter (binning) |
Violin |
— |
|
|
Distribution shape by category |
Heatmap |
— |
|
|
Correlation matrices, grids |
Count plot |
— |
— |
|
Frequency of categories |
Pair plot |
— |
— |
|
Multivariate relationships |
Joint plot |
— |
— |
|
Bivariate + marginals |
Regression |
— |
— |
|
Linear relationships + CI |
Contour |
— |
|
|
Continuous 2D fields/level sets |
10.4. Seaborn Datasets#
Seaborn comes with several datasets that are commonly used for learning data science and machine learning.
Name |
Rows × Cols |
What it’s about |
Common uses |
|---|---|---|---|
|
244 × 7 |
Restaurant bills & tips |
categorical plots, regression, grouping |
|
344 × 7 |
Palmer penguins (species & measurements) |
scatter, hue, KDE, classification demos |
|
150 × 5 |
Iris flower measurements |
pairplots, clustering, basics |
|
53,940 × 10 |
Diamond prices & attributes |
regression, categorical + numeric |
|
144 × 3 |
Monthly air passengers (’49–’60) |
heatmaps, time series |
|
891 × 15 |
Titanic passengers |
categorical analysis, missing data |
|
1,035 × 6 |
Exoplanet discoveries |
distributions, facet grids |
|
1,064 × 4 |
fMRI signal over time |
lineplots with CIs |
|
90 × 4 |
Exercise & pulse |
catplots, faceting |
|
44 × 3 |
Anscombe’s quartet |
scatter + regression; cautionary stats |
To load the datasets, use
10.4.1. Iris#
The Iris dataset, introduced by Ronald A. Fisher in 1936, is one of the most well-known datasets in statistics and machine learning. It is often used for testing classification and visualization techniques. It contains measurements of three species of iris flowers: Setosa, Versicolor, and Virginica. The features of the iris dataset include:
Feature |
Description |
Units |
|---|---|---|
sepal_length |
Length of the outer part of the flower |
cm |
sepal_width |
Width of the outer part |
cm |
petal_length |
Length of the inner petal |
cm |
petal_width |
Width of the inner petal |
cm |
species |
Type of iris flower (setosa, versicolor, virginica) |
categorical |
### load dataset
iris = sns.load_dataset("iris")
iris.head()
| sepal_length | sepal_width | petal_length | petal_width | species | |
|---|---|---|---|---|---|
| 0 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
10.4.2. Tips#
The tips dataset has features such as:
Feature |
Description |
Type |
|---|---|---|
total_bill |
Total bill amount (including tax) |
float |
tip |
Tip amount given by the customer |
float |
sex |
Gender of the server’s customer (Male, Female) |
categorical |
smoker |
Whether the party included smokers (Yes, No) |
categorical |
day |
Day of the week (Thur, Fri, Sat, Sun) |
categorical |
time |
Meal type (Lunch, Dinner) |
categorical |
size |
Number of people in the dining party |
integer |
tips = sns.load_dataset('tips')
tips.head()
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
10.4.3. Planets#
planets = sns.load_dataset('planets')
planets.head()
| method | number | orbital_period | mass | distance | year | |
|---|---|---|---|---|---|---|
| 0 | Radial Velocity | 1 | 269.300 | 7.10 | 77.40 | 2006 |
| 1 | Radial Velocity | 1 | 874.774 | 2.21 | 56.95 | 2008 |
| 2 | Radial Velocity | 1 | 763.000 | 2.60 | 19.84 | 2011 |
| 3 | Radial Velocity | 1 | 326.030 | 19.40 | 110.62 | 2007 |
| 4 | Radial Velocity | 1 | 516.220 | 10.50 | 119.47 | 2009 |