Handling Datasets

7.2. Handling Datasets#

After learning Python data structures (e.g., list, dictionary, tuple) and NumPy arrays, we are now ready to deal with data using Pandas. There are different file types used for datasets and we are going to learn how to import and export them in this section as part of data management using Pandas.

# if you have not installed pandas through the command line
# do so by opening a new command line tab/window, then
# activate the virtual environment, and issue "pip install pandas"
# or, you may do "%pip install pandas" in a Jupyter notebook.
# do remember to comment out the installation line afterwards
# so that it would not run every time you run the cell

# we also need NumPy sometimes for processing data as it's the foundation of Pandas

# %pip install pandas
# %pip install numpy
import numpy as np
import pandas as pd

7.2.1. Accessing Dataset Files#

You can also pass the path (full or relative) and file names of your data as arguments to the Pandas read_ functions to directly read the file. However, it is recommended to use a path variable for your data directory, which will make your life easier in the long run.

7.2.1.1. Building the path variable#

To build a path variable to access your data folder in Jupyter Notebook environment, you can either use a full path or a relative path.

Full path:
- macOS (and Linux): A full/absolute path starts with “/” (the “root” directory); e.g., data_path = "/Users/<user>/workspace/dsm/data/".
- Windows: A full path includes a drive letter and backslashes (\) for path separator; e.g. data_path = "C:\Users\<user>\workspace\dsm\data\”) although forward slashes usually work as well.
Relative path: Usually it’s just easier to use a relative path to access the data folders nearby (e.g., data_path = ../../data/).
User home directory: Instead of starting from the root level, you can usually use your home directory as the starting point to build the path variable to your data directory. os.path.expanduser('~') resolve the directory passed to (e.g., ~, representing the home directory) to a full path:

### works on macOS, Linux, and Windows
import os
home_dir = os.path.expanduser('~') # get the user's home directory 

data_path = os.path.expanduser(home_dir + '/workspace/dsm/data')
print("data_path 1: ", data_path)

data_path = os.path.expanduser('~/workspace/dsm/data')  # this also works on macOS and Linux
print("data_path 2: ", data_path)

data_path 1:  /Users/tcn85/workspace/dsm/data
data_path 2:  /Users/tcn85/workspace/dsm/data

7.2.1.2. Join paths#

After you have your path variable (e.g. data_path) to your data directory, you join the path variable with the paths/filenames using os.path.join() to access the dataset file. It can take multiple arguments like os.path.join("folder1", "folder2", "file.txt"). This way you don’t have to worry about the path separator (”/” or “\”).

### join the path and the filename
data_file = os.path.join(data_path, "data.csv")
print("data file once: ", data_file)  # print the full path to the data file

### os.path.join() is flexible:
data_file = os.path.join(home_dir, "workspace", "dsm", "data", "data.csv")
print("data file twice: ", data_file)  # print the full path to the data file

data file once:  /Users/tcn85/workspace/dsm/data/data.csv
data file twice:  /Users/tcn85/workspace/dsm/data/data.csv

data_file = os.path.join(data_path, 'data.csv' )
data_file

'/Users/tcn85/workspace/dsm/data/data.csv'

7.2.1.3. Using relative path#

To use a relative path to access a dataset file in another folder (e.g., “data”), you need to know the relative location of the data folder and usually we use .. to move up in the directory structure when needed. .. means move up one level in the directory tree. For example, to access the data directory (see figure below):

If you are working on the YOU-ARE-HERE.ipynb notebook , you need to go up one level (..) above the notebooks directory, then go down into the data directory:

data_path = pd.read_csv('../data/')
If you are working on the YOU-MAY-BE-HERE.ipynb notebook, you need to move up two levels (folder and notebooks, or ../..), then go down into the data directory: data_path = pd.read_csv('../../data/')


(.venv) PS C:\Users\[user]\workspace\dsm> tree
Folder PATH listing
Volume serial number is 420C-3FF4
C:.
├───.ipynb_checkpoints
├───.venv
├── data
│   ├── actors.csv
│   ├── example
│   └── wine.csv
├── notebooks
│   ├── YOU-ARE-HERE.ipynb
│   └── folder
│       └── YOU-MAY-BE-HERE.ipynb


  [user]@[host]ː~/workspace/dsm$ tree
.
├── data
│   ├── actors.csv
│   ├── example
│   └── wine.csv
├── notebooks
│   ├── YOU-ARE-HERE.ipynb
│   └── folder
│       └── YOU-MAY-BE-HERE.ipynb

You can see that a relative path can be easier to setup for accessing data if you know the relative position of the data directory to your current working directory (CWD). Often time we even use it in an ad hoc manner without building a folder path variable but accessing the datasets directly:

### You need to adjust the path depth (i.e., how many ../.. you need) 
### /data/data.csv
### ../data/data.csv
### ../../data/data.csv
### ../../../data/data.csv

data_file = ("../data/data.csv")
df = pd.read_csv(data_file)

# just to show the path variables we used
data_path

'/Users/tcn85/workspace/dsm/data'

7.2.2. Reading Dataset Files#

Now let us read the dataset file (data_file) using Pandas pd.read_csv( ). You may read the data file directly or and open it as a file and save it a variable so you can manipulate the data file.

# home = os.path.expanduser('~')
# data_path = os.path.join(home, 'workspace', 'dsm', 'data').  ###, or
# data_path = os.path.join(home, '/workspace/dsm/data/')       ### cross-platform
# data_file = os.path.join(data_path, 'data.csv')

df = pd.read_csv(data_file)  ### read the csv file into a DataFrame
df.head()                    ### display the first 5 rows of the DataFrame

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479
2	60	103	135	340
3	45	109	175	282.4
4	45	117	148	406

To access your dataset files using the Pandas read_ functions, the path variable and the path.join function can be used together to pass the path/file as arguments:

### path + filename

df = pd.read_csv(data_path + "/data.csv")

### or using os.path.join(path variable, filename)

data_file = os.path.join(data_path, "data.csv")
df = pd.read_csv(data_file)

### use the with statement to open the file
### if you need to manipulate the file object

with open(data_path+"/data.csv") as f:
    df = pd.read_csv(f)

### alternatively, change directory into the data folder 
### and read the dataset file from there.

os.chdir(data_path)
df = pd.read_csv("data.csv")
df.describe()

	Duration	Pulse	Maxpulse	Calories
count	339	339	339	329
unique	17	48	58	146
top	60	100	120	300
freq	158	38	32	10

7.2.3. Data Input and Output#

This section serves as the reference code for obtaining input and output with Pandas. Pandas can read a variety of file types using its pd.read_ methods. Here we cover the most common data types:

CSV
Excel
HTML
SQL

7.2.3.1. CSV#

7.2.3.1.1. CSV Input#

### note that you may need to change the path below to point to the correct location of the example file
df = pd.read_csv('../data/example')
df

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

7.2.3.1.2. CSV Output#

df.to_csv('example',index=False)
df

	a	b	c	d
0	0	1	2	3
1	4	5	6	7
2	8	9	10	11
3	12	13	14	15

7.2.3.2. Excel#

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

7.2.3.2.1. Excel Input#

If you see the error message:

ImportError: Missing optional dependency ‘openpyxl’. Use pip or conda to install openpyxl.

Install the dependency by issuing %pip install openpyxl

### install dependencies if needed
# %pip install openpyxl

### read Excel file
df = pd.read_excel('../data/Excel_Sample.xlsx', sheet_name='Sheet1')
df

	Unnamed: 0.1	Unnamed: 0	a	b	c	d
0	0	0	0	1	2	3
1	1	1	4	5	6	7
2	2	2	8	9	10	11
3	3	3	12	13	14	15

type(df)

pandas.core.frame.DataFrame

7.2.3.2.2. Excel Output#

### note this output will be in the same folder as your notebook

df.to_excel('Excel_Sample.xlsx', sheet_name='Sheet1')

7.2.3.3. HTML#

You may need to install htmllib5,lxml, and BeautifulSoup4. In your terminal/command prompt run:

conda install lxml
conda install html5lib
conda install BeautifulSoup4

Then restart Jupyter Notebook. (or use pip install if you aren’t using the Anaconda Distribution)

Pandas can read table tabs off of html. For example:

7.2.3.3.1. HTML Input#

If you see the error message:

ImportError: Missing optional dependency 'lxml'. Use pip or conda to install lxml.

Install the dependency: %pip install lxml
macOS users do this to get rid of a certificate error (4 steps in the (.venv) lines:
```
(.venv) [user]@[host]ː~/workspace/dsm$ cd /Applications/Python 3.12/
```

(.venv) [user]@[host]ː/Applications/Python 3.12$ ls Icon? Install Certificates.command Python Documentation.html ReadMe.rtf IDLE.app License.rtf Python Launcher.app Update Shell Profile.command (.venv) [user]@[host]ː/Applications/Python 3.12$ ./Install\ Certificates.command – pip install –upgrade certifi Collecting certifi Using cached certifi-2025.8.3-py3-none-any.whl.metadata (2.4 kB) Using cached certifi-2025.8.3-py3-none-any.whl (161 kB) Installing collected packages: certifi Successfully installed certifi-2025.8.3 (.venv) [user]@[host]ː/Applications/Python 3.12$ cd ~/workspace/dsm ```

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

##### install dependencies if needed
# %pip install lxml
# %pip install pandas

import pandas as pd
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

df[0]

	Bank Name	City	State	Cert	Acquiring Institution	Closing Date	Fund Sort ascending
0	The Santa Anna National Bank	Santa Anna	Texas	5520	Coleman County State Bank	June 27, 2025	10549
1	Pulaski Savings Bank	Chicago	Illinois	28611	Millennium Bank	January 17, 2025	10548
2	The First National Bank of Lindsay	Lindsay	Oklahoma	4134	First Bank & Trust Co.	October 18, 2024	10547
3	Republic First Bank dba Republic Bank	Philadelphia	Pennsylvania	27332	Fulton Bank, National Association	April 26, 2024	10546
4	Citizens Bank	Sac City	Iowa	8758	Iowa Trust & Savings Bank	November 3, 2023	10545
5	Heartland Tri-State Bank	Elkhart	Kansas	25851	Dream First Bank, N.A.	July 28, 2023	10544
6	First Republic Bank	San Francisco	California	59017	JPMorgan Chase Bank, N.A.	May 1, 2023	10543
7	Signature Bank	New York	New York	57053	Flagstar Bank, N.A.	March 12, 2023	10540
8	Silicon Valley Bank	Santa Clara	California	24735	First Citizens Bank & Trust Company	March 10, 2023	10539
9	Almena State Bank	Almena	Kansas	15426	Equity Bank	October 23, 2020	10538
10	First City Bank of Florida	Fort Walton Beach	Florida	16748	United Fidelity Bank, fsb	October 16, 2020	10537
11	The First State Bank	Barboursville	West Virginia	14361	MVB Bank, Inc.	April 3, 2020	10536
12	Ericson State Bank	Ericson	Nebraska	18265	Farmers and Merchants Bank	February 14, 2020	10535
13	City National Bank of New Jersey	Newark	New Jersey	21111	Industrial Bank	November 1, 2019	10534
14	Resolute Bank	Maumee	Ohio	58317	Buckeye State Bank	October 25, 2019	10533
15	Louisa Community Bank	Louisa	Kentucky	58112	Kentucky Farmers Bank Corporation	October 25, 2019	10532
16	The Enloe State Bank	Cooper	Texas	10716	Legend Bank, N. A.	May 31, 2019	10531
17	Washington Federal Bank for Savings	Chicago	Illinois	30570	Royal Savings Bank	December 15, 2017	10530
18	The Farmers and Merchants State Bank of Argonia	Argonia	Kansas	17719	Conway Bank	October 13, 2017	10529
19	Fayette County Bank	Saint Elmo	Illinois	1802	United Fidelity Bank, fsb	May 26, 2017	10528
20	Guaranty Bank, (d/b/a BestBank in Georgia & Mi...	Milwaukee	Wisconsin	30003	First-Citizens Bank & Trust Company	May 5, 2017	10527
21	First NBC Bank	New Orleans	Louisiana	58302	Whitney Bank	April 28, 2017	10526
22	Proficio Bank	Cottonwood Heights	Utah	35495	Cache Valley Bank	March 3, 2017	10525
23	Seaway Bank and Trust Company	Chicago	Illinois	19328	State Bank of Texas	January 27, 2017	10524
24	Harvest Community Bank	Pennsville	New Jersey	34951	First-Citizens Bank & Trust Company	January 13, 2017	10523

7.2.3.4. SQL#

(source ) The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.

If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-API.

See also some cookbook examples for some advanced strategies.

The key functions are:

read_sql_table(table_name, con[, schema, …])
- Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, …])
- Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, …])
- Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, …])
- Write records stored in a DataFrame to a SQL database.

# %pip install SQLAlchemy
from sqlalchemy import create_engine

engine = create_engine('sqlite:///:memory:')

print(type(df))
df = pd.DataFrame(df[0])

<class 'list'>

df.to_sql('data', engine)

sql_df = pd.read_sql('data',con=engine)

sql_df

	index	Bank Name	City	State	Cert	Acquiring Institution	Closing Date	Fund Sort ascending
0	0	The Santa Anna National Bank	Santa Anna	Texas	5520	Coleman County State Bank	June 27, 2025	10549
1	1	Pulaski Savings Bank	Chicago	Illinois	28611	Millennium Bank	January 17, 2025	10548
2	2	The First National Bank of Lindsay	Lindsay	Oklahoma	4134	First Bank & Trust Co.	October 18, 2024	10547
3	3	Republic First Bank dba Republic Bank	Philadelphia	Pennsylvania	27332	Fulton Bank, National Association	April 26, 2024	10546
4	4	Citizens Bank	Sac City	Iowa	8758	Iowa Trust & Savings Bank	November 3, 2023	10545
5	5	Heartland Tri-State Bank	Elkhart	Kansas	25851	Dream First Bank, N.A.	July 28, 2023	10544
6	6	First Republic Bank	San Francisco	California	59017	JPMorgan Chase Bank, N.A.	May 1, 2023	10543
7	7	Signature Bank	New York	New York	57053	Flagstar Bank, N.A.	March 12, 2023	10540
8	8	Silicon Valley Bank	Santa Clara	California	24735	First Citizens Bank & Trust Company	March 10, 2023	10539
9	9	Almena State Bank	Almena	Kansas	15426	Equity Bank	October 23, 2020	10538
10	10	First City Bank of Florida	Fort Walton Beach	Florida	16748	United Fidelity Bank, fsb	October 16, 2020	10537
11	11	The First State Bank	Barboursville	West Virginia	14361	MVB Bank, Inc.	April 3, 2020	10536
12	12	Ericson State Bank	Ericson	Nebraska	18265	Farmers and Merchants Bank	February 14, 2020	10535
13	13	City National Bank of New Jersey	Newark	New Jersey	21111	Industrial Bank	November 1, 2019	10534
14	14	Resolute Bank	Maumee	Ohio	58317	Buckeye State Bank	October 25, 2019	10533
15	15	Louisa Community Bank	Louisa	Kentucky	58112	Kentucky Farmers Bank Corporation	October 25, 2019	10532
16	16	The Enloe State Bank	Cooper	Texas	10716	Legend Bank, N. A.	May 31, 2019	10531
17	17	Washington Federal Bank for Savings	Chicago	Illinois	30570	Royal Savings Bank	December 15, 2017	10530
18	18	The Farmers and Merchants State Bank of Argonia	Argonia	Kansas	17719	Conway Bank	October 13, 2017	10529
19	19	Fayette County Bank	Saint Elmo	Illinois	1802	United Fidelity Bank, fsb	May 26, 2017	10528
20	20	Guaranty Bank, (d/b/a BestBank in Georgia & Mi...	Milwaukee	Wisconsin	30003	First-Citizens Bank & Trust Company	May 5, 2017	10527
21	21	First NBC Bank	New Orleans	Louisiana	58302	Whitney Bank	April 28, 2017	10526
22	22	Proficio Bank	Cottonwood Heights	Utah	35495	Cache Valley Bank	March 3, 2017	10525
23	23	Seaway Bank and Trust Company	Chicago	Illinois	19328	State Bank of Texas	January 27, 2017	10524
24	24	Harvest Community Bank	Pennsville	New Jersey	34951	First-Citizens Bank & Trust Company	January 13, 2017	10523