Handling Datasets

4.2. Handling Datasets#

After learning Python data structures (e.g., list, dictionary, tuple) and NumPy arrays, we are now ready to deal with data using Pandas. There are different file types used for datasets and we are going to learn how to import and export them in this section as part of data management using Pandas.

# if you have not installed pandas through the command line
# do so by opening a new command line tab/window, then
# activate the virtual environment, and issue "pip install pandas"
# or, you may do "%pip install pandas" in a Jupyter notebook.
# do remember to comment out the installation line afterwards
# so that it would not run every time you run the cell

# we also need NumPy sometimes for processing data as it's the foundation of Pandas

# %pip install pandas
# %pip install numpy
import numpy as np
import pandas as pd

4.2.1. Accessing Files#

You can also pass the path (full or relative) and file names of your data as arguments to the Pandas read_ functions to directly read the file. However, it is recommended to use a path variable for your data directory, which will make your life easier in the long run.

To access your data folder in Jupyter Notebook environment, you can either use a full path or a relative path.

Full path:
- macOS (and Linux): A full/absolute path starts with “/” (the “root” directory); e.g., data_path = "/Users/<user>/workspace/dsm/data/".
- Windows: A full path includes a drive letter and backslashes (\) for path separator; e.g. data_path = "C:\Users\<user>\workspace\dsm\data\”) although forward slashes usually work as well.
Relative path: Usually it’s just easier to use a relative path to access the data folders nearby (e.g., data_path = ../../data/).
User home directory: Instead of starting from the system root level, you can usually use your user home directory as the starting point to reach your data file directory. Python os.path.expanduser('~') function resolves the directory passed to it (e.g., ~, representing the home directory) to a full path.

4.2.1.1. Building Path Variable#

To access your dataset files using the Pandas read_ functions, the path variable and the path.join function can be used together to pass the path/file as arguments:

### works on macOS, Linux, and Windows
import os

home_dir = os.path.expanduser('~') # get the user's home directory 
data_path = os.path.expanduser(home_dir + '/workspace/dsm/data/')  ### concatenate project data path

print(data_path)    ### output will vary based on your system and project data path

df = pd.read_csv(data_path + "data.csv")  ### read the CSV file into a DataFrame
df.head(3)

/Users/tychen/workspace/dsm/data/

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0

4.2.1.2. Accessing Dataset Files#

After you have your path variable (e.g. data_path) to your data directory, you join the path variable with the paths/filenames using os.path.join() to access the dataset file. It can take multiple arguments like os.path.join("folder1", "folder2", "file.txt"). This way you don’t have to worry about the path separator (”/” or “\”).

### or using os.path.join(path variable, filename)

data_file = os.path.join(data_path, "data.csv")
df = pd.read_csv(data_file)
df.head(3)

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0

### join the path and the filename
data_file = os.path.join(data_path, "data.csv")
print("data file once: ", data_file)  # print the full path to the data file

data file once:  /Users/tychen/workspace/dsm/data/data.csv

data_file = os.path.join(data_path, 'data.csv' )
data_file

'/Users/tychen/workspace/dsm/data/data.csv'

4.2.1.3. Using Relative Paths#

To use a relative path to access a dataset file in another folder (e.g., “data”), you need to know the relative location of the data folder and usually we use .. to move up in the directory structure when needed. .. means move up one level in the directory tree. For example, to access the data directory (see figure below):

If you are working on the YOU-ARE-HERE.ipynb notebook , you need to go up one level (..) above the notebooks directory, then go down into the data directory:

data_path = pd.read_csv('../data/')
If you are working on the YOU-MAY-BE-HERE.ipynb notebook, you need to move up two levels (folder and notebooks, or ../..), then go down into the data directory: data_path = pd.read_csv('../../data/')

To visualize your project folder, you may use the tree command in the command line.


(.venv) PS C:\Users\[user]\workspace\dsm> tree
Folder PATH listing
Volume serial number is 420C-3FF4
C:.
├───.ipynb_checkpoints
├───.venv
├── data
│   ├── actors.csv
│   ├── example
│   └── wine.csv
├── notebooks
│   ├── YOU-ARE-HERE.ipynb
│   └── folder
│       └── YOU-MAY-BE-HERE.ipynb


  [user]@[host]ː~/workspace/dsm$ tree
.
├── data
│   ├── actors.csv
│   ├── example
│   └── wine.csv
├── notebooks
│   ├── YOU-ARE-HERE.ipynb
│   └── folder
│       └── YOU-MAY-BE-HERE.ipynb

You can see that a relative path can be easier to setup for accessing data if you know the relative position of the data directory to your current working directory (CWD). Often time we even use it in an ad hoc manner without building a folder path variable but accessing the datasets directly:

### You need to adjust the path depth (i.e., how many ".." you need depending on where you 
### run the code. If you run the code in the same directory as the data file, then 
### you can just use "data.csv" as the path.If you run the code in a sub-subdirectory, then you need to 
### use "../../data/data.csv" to go up two levels and then into the data directory, and so on.) 
### ./data/data.csv
### ../data/data.csv
### ../../data/data.csv
### ../../../data/data.csv

# data_file = ("../../data/data.csv")
df = pd.read_csv("../../data/data.csv")

# just to show the path variables we used
# 
data_path = os.path.expanduser('~/workspace/dsm/data')
data_path

'/Users/tychen/workspace/dsm/data'

4.2.2. Reading Datasets#

Once you’ve set up your file paths, the next step is to actually read data into Pandas DataFrames. Pandas provides a rich collection of input/output (I/O) functions that can handle virtually any data format you’ll encounter in data science work. These functions are designed to be intuitive and flexible, allowing you to read data from files, databases, web pages, and more with just a single line of code.

The most commonly used function is pd.read_csv() for reading comma-separated values (CSV) files, which is the standard format for tabular data. However, Pandas also supports Excel files, JSON, HTML tables, SQL databases, and many other formats. Each read_ function returns a DataFrame object that you can immediately work with using all of Pandas’ powerful data manipulation capabilities.

4.2.2.1. Common Pandas I/O Functions#

Pandas provides a comprehensive set of functions for reading and writing data in various formats. Below is a summary of the most commonly used functions:

Function	Description	Common Parameters	Returns
`pd.read_csv()`	Read comma-separated values (CSV) file into DataFrame	`filepath`, `sep`, `header`, `index_col`, `usecols`, `dtype`	DataFrame
`pd.read_excel()`	Read Excel file into DataFrame	`filepath`, `sheet_name`, `header`, `index_col`, `usecols`	DataFrame
`pd.read_html()`	Read HTML tables into a list of DataFrames	`url`, `match`, `header`, `index_col`	List of DataFrames
`pd.read_json()`	Read JSON string or file into DataFrame	`filepath`, `orient`, `typ`, `dtype`	DataFrame or Series
`pd.read_sql()`	Read SQL query or database table into DataFrame	`sql`, `con`, `index_col`, `columns`	DataFrame
`pd.read_sql_table()`	Read SQL database table into DataFrame	`table_name`, `con`, `schema`, `index_col`	DataFrame
`pd.read_sql_query()`	Read SQL query into DataFrame	`sql`, `con`, `index_col`, `params`	DataFrame

Output Functions:

Function	Description	Common Parameters
`df.to_csv()`	Write DataFrame to CSV file	`filepath`, `sep`, `index`, `header`, `mode`
`df.to_excel()`	Write DataFrame to Excel file	`filepath`, `sheet_name`, `index`, `header`
`df.to_json()`	Write DataFrame to JSON file	`filepath`, `orient`, `indent`
`df.to_sql()`	Write DataFrame to SQL database	`name`, `con`, `if_exists`, `index`

Key Parameters:

filepath: Path to the file (string or file-like object)
sep: Delimiter to use (CSV, default is comma)
header: Row number(s) to use as column names
index_col: Column(s) to use as row labels
usecols: Subset of columns to read
dtype: Data type for data or columns
sheet_name: Excel sheet name or index
con: Database connection object
index: Whether to write row names (index)

4.2.2.2. `pd.read_csv()`#

Now let us read the dataset file (data_file) using Pandas pd.read_csv() function. You may read the data file directly or open it as a file and save it to a variable so you can manipulate the data.

# home = os.path.expanduser('~')
# data_path = os.path.join(home, 'workspace', 'dsm', 'data').  ###, or
# data_path = os.path.join(home, '/workspace/dsm/data/')       ### cross-platform
# data_file = os.path.join(data_path, 'data.csv')

df = pd.read_csv(data_file)  ### read the csv file into a DataFrame
df.head()                    ### display the first 5 rows of the DataFrame

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0
3	45	109	175	282.4
4	45	117	148	406.0

4.2.3. Data Input and Output#

This section serves as the reference code for obtaining input and output with Pandas. Pandas can read a variety of file types using its pd.read_ methods. Here we cover the most common data types:

CSV
Excel
HTML
SQL

4.2.3.1. CSV#

4.2.3.1.1. CSV Input#

### note that you may need to change the path below to point to the correct location of the example file
df = pd.read_csv('../../data/data.csv')
df

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0
3	45	109	175	282.4
4	45	117	148	406.0
...	...	...	...	...
160	30	85	120	250.4
161	45	90	130	260.4
162	45	95	130	270.0
163	45	100	140	280.9
164	60	105	140	290.8

165 rows × 4 columns

4.2.3.1.2. CSV Output#

df.to_csv('data.csv',index=False)
df

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0
3	45	109	175	282.4
4	45	117	148	406.0
...	...	...	...	...
160	30	85	120	250.4
161	45	90	130	260.4
162	45	95	130	270.0
163	45	100	140	280.9
164	60	105	140	290.8

165 rows × 4 columns

4.2.3.2. Excel#

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

4.2.3.2.1. Excel Input#

If you see the error message:

ImportError: Missing optional dependency ‘openpyxl’. Use pip or conda to install openpyxl.

Install the dependency by issuing %pip install openpyxl

### install dependencies if needed
# %pip install openpyxl

### read Excel file
df = pd.read_excel('../../data/Excel_Sample.xlsx', sheet_name='Sheet1')
df

	Unnamed: 0.14	Unnamed: 0.13	Unnamed: 0.12	Unnamed: 0.11	Unnamed: 0.10	Unnamed: 0.9	Unnamed: 0.8	Unnamed: 0.7	Unnamed: 0.6	Unnamed: 0.5	Unnamed: 0.4	Unnamed: 0.3	Unnamed: 0.2	Unnamed: 0.1	Unnamed: 0	a	b	c	d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	3
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	4	5	6	7
2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	8	9	10	11
3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	12	13	14	15

type(df)

pandas.core.frame.DataFrame

4.2.3.2.2. Excel Output#

### note this output will be in the same folder as your notebook

df.to_excel('Excel_Sample.xlsx', sheet_name='Sheet1')

4.2.3.3. HTML#

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

conda install lxml
conda install html5lib
conda install BeautifulSoup4

Then restart Jupyter Notebook. (or use pip install if you aren’t using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

4.2.3.3.1. HTML Input#

If you see the error message:

ImportError: Missing optional dependency 'lxml'. Use pip or conda to install lxml.

Install the dependency: %pip install lxml
macOS users do this to get rid of a certificate error (4 steps in the (.venv) lines below; source ):

(.venv) [user]@[host]ː~/workspace/dsm$ cd /Applications/Python 3.12/
(.venv) [user]@[host]ː/Applications/Python 3.12$ ls
        Icon?        Install Certificates.command Python Documentation.html        ReadMe.rtf
        IDLE.app     License.rtf      Python Launcher.app Update Shell Profile.command
(.venv) [user]@[host]ː/Applications/Python 3.12$ ./Install\ Certificates.command 
    -- pip install --upgrade certifi
    Collecting certifi
      Using cached certifi-2025.8.3-py3-none-any.whl.metadata (2.4 kB)
    Using cached certifi-2025.8.3-py3-none-any.whl (161 kB)
    Installing collected packages: certifi
    Successfully installed certifi-2025.8.3
(.venv) [user]@[host]ː/Applications/Python 3.12$ cd ~/workspace/dsm

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

##### install dependencies if needed
# %pip install lxml
# %pip install pandas

import pandas as pd
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

df[0].head()

	Bank Name	City	State	Cert	Acquiring Institution	Closing Date	Fund Sort ascending
0	Community Bank and Trust - West Georgia	LaGrange	Georgia	25796	Anchor Bank	May 1, 2026	10551
1	Metropolitan Capital Bank & Trust	Chicago	Illinois	57488	First Independence Bank	January 30, 2026	10550
2	The Santa Anna National Bank	Santa Anna	Texas	5520	Coleman County State Bank	June 27, 2025	10549
3	Pulaski Savings Bank	Chicago	Illinois	28611	Millennium Bank	January 17, 2025	10548
4	The First National Bank of Lindsay	Lindsay	Oklahoma	4134	First Bank & Trust Co.	October 18, 2024	10547

4.2.3.4. SQL#

(source ) The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.

If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.

See also some cookbook examples for some advanced strategies.

The key functions are:

read_sql_table(table_name, con[, schema, …])
- Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, …])
- Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, …])
- Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, …])
- Write records stored in a DataFrame to a SQL database.

# %pip install SQLAlchemy
from sqlalchemy import create_engine

engine = create_engine('sqlite:///:memory:')

print(type(df))
df = pd.DataFrame(df[0])

<class 'list'>

df.to_sql('data', engine)

sql_df = pd.read_sql('data',con=engine)

sql_df.head()

	index	Bank Name	City	State	Cert	Acquiring Institution	Closing Date	Fund Sort ascending
0	0	Community Bank and Trust - West Georgia	LaGrange	Georgia	25796	Anchor Bank	May 1, 2026	10551
1	1	Metropolitan Capital Bank & Trust	Chicago	Illinois	57488	First Independence Bank	January 30, 2026	10550
2	2	The Santa Anna National Bank	Santa Anna	Texas	5520	Coleman County State Bank	June 27, 2025	10549
3	3	Pulaski Savings Bank	Chicago	Illinois	28611	Millennium Bank	January 17, 2025	10548
4	4	The First National Bank of Lindsay	Lindsay	Oklahoma	4134	First Bank & Trust Co.	October 18, 2024	10547

4.2.3.4.1. Reading from SQLite Databases#

Beyond CSV files, Pandas can read from many data sources. Let’s explore loading data from a SQLite database that we’ve used in previous SQL activities. You’ll appreciate how convenient Pandas makes this!

To connect to a SQLite database, we:

Import the sqlite3 library (part of Python’s standard library)
Connect to the database using sqlite3.connect()
Use the Pandas function read_sql_query() to execute your query
Close the connection using the close() function (or use with statement)

### List all tables in the SQLite database (similar to "SHOW TABLES;" in MySQL)

import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

query = "SELECT name FROM sqlite_master WHERE type='table';"

pd.read_sql_query(query, conn)

	name
0	Album
1	Artist
2	Customer
3	Employee
4	Genre
5	Invoice
6	InvoiceLine
7	MediaType
8	Playlist
9	PlaylistTrack
10	Track

import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

df_customer = pd.read_sql_query("SELECT * FROM customer;", conn)
df_customer.head()

	CustomerId	FirstName	LastName	Company	Address	City	State	Country	PostalCode	Phone	Fax	Email	SupportRepId
0	1	Luís	Gonçalves	Embraer - Empresa Brasileira de Aeronáutica S.A.	Av. Brigadeiro Faria Lima, 2170	São José dos Campos	SP	Brazil	12227-000	+55 (12) 3923-5555	+55 (12) 3923-5566	luisg@embraer.com.br	3
1	2	Leonie	Köhler	None	Theodor-Heuss-Straße 34	Stuttgart	None	Germany	70174	+49 0711 2842222	None	leonekohler@surfeu.de	5
2	3	François	Tremblay	None	1498 rue Bélanger	Montréal	QC	Canada	H2G 1A7	+1 (514) 721-4711	None	ftremblay@gmail.com	3
3	4	Bjørn	Hansen	None	Ullevålsveien 14	Oslo	None	Norway	0171	+47 22 44 22 22	None	bjorn.hansen@yahoo.no	4
4	5	František	Wichterlová	JetBrains s.r.o.	Klanova 9/506	Prague	None	Czech Republic	14700	+420 2 4172 5555	+420 2 4172 5555	frantisekw@jetbrains.com	4

# BETTER PRACTICE: Use 'with' statement to automatically close connection
# The 'with' statement ensures the connection is closed even if an error occurs

sql = 'SELECT * FROM Artist LIMIT 10'
with sqlite3.connect("../../data/chinook.db") as conn:
    df_artist = pd.read_sql_query(sql, conn)

df_artist

	ArtistId	Name
0	1	AC/DC
1	2	Accept
2	3	Aerosmith
3	4	Alanis Morissette
4	5	Alice In Chains
5	6	Antônio Carlos Jobim
6	7	Apocalyptica
7	8	Audioslave
8	9	BackBeat
9	10	Billy Cobham

	Unnamed: 0.14	Unnamed: 0.13	Unnamed: 0.12	Unnamed: 0.11	Unnamed: 0.10	Unnamed: 0.9	Unnamed: 0.8	Unnamed: 0.7	Unnamed: 0.6	Unnamed: 0.5	Unnamed: 0.4	Unnamed: 0.3	Unnamed: 0.2	Unnamed: 0.1	Unnamed: 0	a	b	c	d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	3
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	4	5	6	7
2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	8	9	10	11
3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	12	13	14	15

	Unnamed: 0.14	Unnamed: 0.13	Unnamed: 0.12	Unnamed: 0.11	Unnamed: 0.10	Unnamed: 0.9	Unnamed: 0.8	Unnamed: 0.7	Unnamed: 0.6	Unnamed: 0.5	Unnamed: 0.4	Unnamed: 0.3	Unnamed: 0.2	Unnamed: 0.1	Unnamed: 0	a	b	c	d
0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	1	2	3
1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	1	4	5	6	7
2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	2	8	9	10	11
3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	3	12	13	14	15