Pandas DataFrames

7.3. Pandas DataFrames#

DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame can be thought of as a collection of Series objects combined into a single table, all sharing the same index. Let’s use Pandas to explore how DataFrames work!

import numpy as np
import pandas as pd

7.3.1. Reading Datasets#

Following the instruction in Handling Dataset , you are able to access the dataset files in your data directory.

7.3.1.1. read_csv( ) to read file#

The most common way to read a file is to use read_csv.

### the path Windows will resolve them correctly

df = pd.read_csv("../../data/data.csv")
df

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479
2	60	103	135	340
3	45	109	175	282.4
4	45	117	148	406
...	...	...	...	...
334	60	105	140	290.8
335	60	110	145	300.4
336	60	115	145	310.2
337	75	120	150	320.4
338	75	125	150	330.4

339 rows × 4 columns

Now that we have the DataFrame object, let’s use df.info() to get the:

data type
number of values in the column
Null values

df.info() 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339 entries, 0 to 338
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Duration  339 non-null    object
 1   Pulse     339 non-null    object
 2   Maxpulse  339 non-null    object
 3   Calories  329 non-null    object
dtypes: object(4)
memory usage: 10.7+ KB

7.3.1.2. df Output#

df
print(df): header(5) tail(5)
to_string() for whole df
pd.options.display.max_rows

df      ### default view is table-like
        ### Pandas will give 5+5 == 10 rows by default
        ### the shape information is at the bottom

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479
2	60	103	135	340
3	45	109	175	282.4
4	45	117	148	406
...	...	...	...	...
334	60	105	140	290.8
335	60	110	145	300.4
336	60	115	145	310.2
337	75	120	150	320.4
338	75	125	150	330.4

339 rows × 4 columns

print(df)   ### print(), same information but not as pretty
            

    Duration Pulse Maxpulse Calories
       60   110      130    409.1
       60   117      145      479
       60   103      135      340
       45   109      175    282.4
       45   117      148      406
..       ...   ...      ...      ...
     60   105      140    290.8
     60   110      145    300.4
     60   115      145    310.2
     75   120      150    320.4
     75   125      150    330.4

[339 rows x 4 columns]

Sometimes, you need to see more than the default output. You use the pd.options.display.max_rows attribute.

print(pd.options.display.max_rows)     ### 2. display max_rows is 60 by default (depending on the system)

pd.options.display.max_rows =  100     ### increase max_rows

### this line is commented out to avoid overwhelming output
### you can uncomment it to see the full output

# print(df.to_string())          ### 3. to_string() enabling print whole df

7.3.1.3. DataFrame from SQLite#

Now, let us try to load the chinook SQLite database that we have used in the SQL database activity. You will be able to appreciate how convenient Pandas makes loading data to be.

To connect to a sqlite3 database, we:

load sqlite3 library (import; the sqlite module is part of the Python standard library)
connect to the database by calling the sqlite3.connect() function
use the Pandas function read_sql_query() to run your query
close the connection using the sqlite3 close() function

### to check all tables just like "SHOW TABLES;" in MySQL

import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

query = "SELECT name FROM sqlite_master WHERE type='table';"

pd.read_sql_query(query, conn)

	name
0	Album
1	Artist
2	Customer
3	Employee
4	Genre
5	Invoice
6	InvoiceLine
7	MediaType
8	Playlist
9	PlaylistTrack
10	Track

import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

df = pd.read_sql_query("SELECT * FROM customer;", conn)
df.head()

	CustomerId	FirstName	LastName	Company	Address	City	State	Country	PostalCode	Phone	Fax	Email	SupportRepId
0	1	Luís	Gonçalves	Embraer - Empresa Brasileira de Aeronáutica S.A.	Av. Brigadeiro Faria Lima, 2170	São José dos Campos	SP	Brazil	12227-000	+55 (12) 3923-5555	+55 (12) 3923-5566	luisg@embraer.com.br	3
1	2	Leonie	Köhler	None	Theodor-Heuss-Straße 34	Stuttgart	None	Germany	70174	+49 0711 2842222	None	leonekohler@surfeu.de	5
2	3	François	Tremblay	None	1498 rue Bélanger	Montréal	QC	Canada	H2G 1A7	+1 (514) 721-4711	None	ftremblay@gmail.com	3
3	4	Bjørn	Hansen	None	Ullevålsveien 14	Oslo	None	Norway	0171	+47 22 44 22 22	None	bjorn.hansen@yahoo.no	4
4	5	František	Wichterlová	JetBrains s.r.o.	Klanova 9/506	Prague	None	Czech Republic	14700	+420 2 4172 5555	+420 2 4172 5555	frantisekw@jetbrains.com	4

### you need to close the connection

conn.close()

### or use the with keyword to handle closing

sql = 'SELECT * FROM Artist LIMIT 10'
with sqlite3.connect("../../data/chinook.db") as conn:
    df = pd.read_sql_query(sql, conn)

df

	ArtistId	Name
0	1	AC/DC
1	2	Accept
2	3	Aerosmith
3	4	Alanis Morissette
4	5	Alice In Chains
5	6	Antônio Carlos Jobim
6	7	Apocalyptica
7	8	Audioslave
8	9	BackBeat
9	10	Billy Cobham

7.3.2. Creating DataFrames#

7.3.2.1. Use the pd.DataFrame#

pd.DataFrame creates a DataFrame object.

### create the index object. You see that the pd.date_range() function gives 
### a range of days just like a regular range() function

# dates = pd.date_range()     ### shift tab to see doc
dates = pd.date_range("20250901", periods=6)

### evaluate variable dates; dtype is datetime64 and freq='D'

dates

DatetimeIndex(['2025-09-01', '2025-09-02', '2025-09-03', '2025-09-04',
               '2025-09-05', '2025-09-06'],
              dtype='datetime64[ns]', freq='D')

### check data type of object => index

type(dates)

pandas.core.indexes.datetimes.DatetimeIndex

### turing a string into a list => column label

list("ABCD")

['A', 'B', 'C', 'D']

### use pd.DataFrame( ) to create a DataFrame object

# pd.DataFrame    ### shift tab to see doc

np.random.seed(101)      ### seed keeps the "random" numbers constant

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))

df

	A	B	C	D
2025-09-01	2.706850	0.628133	0.907969	0.503826
2025-09-02	0.651118	-0.319318	-0.848077	0.605965
2025-09-03	-2.018168	0.740122	0.528813	-0.589001
2025-09-04	0.188695	-0.758872	-0.933237	0.955057
2025-09-05	0.190794	1.978757	2.605967	0.683509
2025-09-06	0.302665	1.693723	-1.706086	-1.159119

### check data type: pd.read_sql_query() returns a df "DataFrame"

type(df)

pandas.core.frame.DataFrame

7.3.3. DataFrame Indexing#

In this section, we examine the primary techniques for retrieving data stored within a Pandas DataFrame. Mastery of these methods is essential, as they allow us to isolate rows, columns, or specific values for further analysis.

7.3.3.1. Random Seed#

from numpy.random import randn    ### we can import random to get randint & rand

np.random.seed(101)
randn(5, 4)

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

### do it again
np.random.seed(101)
randn(5,4)

array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])

### seed not in the same cell
randn(5,4)

array([[ 0.30266545,  1.69372293, -1.70608593, -1.15911942],
       [-0.13484072,  0.39052784,  0.16690464,  0.18450186],
       [ 0.80770591,  0.07295968,  0.63878701,  0.3296463 ],
       [-0.49710402, -0.7540697 , -0.9434064 ,  0.48475165],
       [-0.11677332,  1.9017548 ,  0.23812696,  1.99665229]])

7.3.3.2. Create DataFrame#

### DataFrame
np.random.seed(101)

### parameters: data, index, columns
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())   ### or list('WXYZ')
df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

7.3.3.3. Column Indexing#

7.3.3.3.1. Using column label#

### same bracket notation for indexing

df['W']

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

### this works but NOT RECOMMENDED

df.W

A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

7.3.3.3.2. Using a list of column labels#

# Pass a list of column names

df[['W','Z']]

	W	Z
A	2.706850	0.503826
B	0.651118	0.605965
C	-2.018168	-0.589001
D	0.188695	0.955057
E	0.190794	0.683509

DataFrame Columns are just Series

type(df['W'])

pandas.core.series.Series

7.3.3.4. Row Indexing#

Accessing DataFrame rows with loc and iloc. The result of selection is Series.

df.loc['A']

W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64

ser = df.loc['A']
type(ser)

pandas.core.series.Series

Or select based off of position instead of label with iloc:

df.iloc[2]

W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

7.3.3.5. Selecting subset of rows and columns#

df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

df.loc['B','Y']

np.float64(-0.8480769834036315)

df.loc[['A','B'],['W','Y']]

	W	Y
A	2.706850	0.907969
B	0.651118	-0.848077

7.3.3.6. Conditional Selection#

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

df>0

	W	X	Y	Z
A	True	True	True	True
B	True	False	False	True
C	False	True	True	False
D	True	False	False	True
E	True	True	True	True

df[df>0]

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	NaN	NaN	0.605965
C	NaN	0.740122	0.528813	NaN
D	0.188695	NaN	NaN	0.955057
E	0.190794	1.978757	2.605967	0.683509

df[df['W']>0]

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

df[df['W']>0]['Y']

A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64

df[df['W']>0][['Y','X']]

	Y	X
A	0.907969	0.628133
B	-0.848077	-0.319318
D	-0.933237	-0.758872
E	2.605967	1.978757

For two conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]

	W	X	Y	Z
E	0.190794	1.978757	2.605967	0.683509

7.3.3.7. Row/Column Opeations#

7.3.3.7.1. Creating a new column:#

Note that this is in-place.

df['new'] = df['W'] + df['Y']

df

	W	X	Y	Z	new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
C	-2.018168	0.740122	0.528813	-0.589001	-1.489355
D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
E	0.190794	1.978757	2.605967	0.683509	2.796762

# arr = np.arange(12).reshape(3, 4)
# arr

7.3.3.7.2. Dropping columns#

Using the drop function. Note that you have to specify axis=1 to drop the column label.

# df['new'] = df['W'] + df['Y']
df.drop('new',axis=1)

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

# Not inplace unless specified!
df

	W	X	Y	Z	new
A	2.706850	0.628133	0.907969	0.503826	3.614819
B	0.651118	-0.319318	-0.848077	0.605965	-0.196959
C	-2.018168	0.740122	0.528813	-0.589001	-1.489355
D	0.188695	-0.758872	-0.933237	0.955057	-0.744542
E	0.190794	1.978757	2.605967	0.683509	2.796762

DataFrame functions mostly require inplace=True for in-place operation

df.drop('new',axis=1,inplace=True)

df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

### new is gone
df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

7.3.3.7.3. Dropping rows#

Note that axis=0.

df.drop('E', axis=0)

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057

7.3.4. More Index Details#

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

df

	W	X	Y	Z
A	2.706850	0.628133	0.907969	0.503826
B	0.651118	-0.319318	-0.848077	0.605965
C	-2.018168	0.740122	0.528813	-0.589001
D	0.188695	-0.758872	-0.933237	0.955057
E	0.190794	1.978757	2.605967	0.683509

# Reset to default 0,1...n index
df.reset_index()

	index	W	X	Y	Z
0	A	2.706850	0.628133	0.907969	0.503826
1	B	0.651118	-0.319318	-0.848077	0.605965
2	C	-2.018168	0.740122	0.528813	-0.589001
3	D	0.188695	-0.758872	-0.933237	0.955057
4	E	0.190794	1.978757	2.605967	0.683509

newind = 'CA NY WY OR CO'.split()

df['States'] = newind

df

	W	X	Y	Z	States
A	2.706850	0.628133	0.907969	0.503826	CA
B	0.651118	-0.319318	-0.848077	0.605965	NY
C	-2.018168	0.740122	0.528813	-0.589001	WY
D	0.188695	-0.758872	-0.933237	0.955057	OR
E	0.190794	1.978757	2.605967	0.683509	CO

df.set_index('States')

	W	X	Y	Z
States
CA	2.706850	0.628133	0.907969	0.503826
NY	0.651118	-0.319318	-0.848077	0.605965
WY	-2.018168	0.740122	0.528813	-0.589001
OR	0.188695	-0.758872	-0.933237	0.955057
CO	0.190794	1.978757	2.605967	0.683509

df

	W	X	Y	Z	States
A	2.706850	0.628133	0.907969	0.503826	CA
B	0.651118	-0.319318	-0.848077	0.605965	NY
C	-2.018168	0.740122	0.528813	-0.589001	WY
D	0.188695	-0.758872	-0.933237	0.955057	OR
E	0.190794	1.978757	2.605967	0.683509	CO

df.set_index('States',inplace=True)

df

	W	X	Y	Z
States
CA	2.706850	0.628133	0.907969	0.503826
NY	0.651118	-0.319318	-0.848077	0.605965
WY	-2.018168	0.740122	0.528813	-0.589001
OR	0.188695	-0.758872	-0.933237	0.955057
CO	0.190794	1.978757	2.605967	0.683509

7.3.5. Multi-Index and Index Hierarchy#

Hierarchical Indexing (or MultiIndex) gives us multiple levels of row or column labels in a DataFrame, which is very useful for working with grouped or multidimensional data.

Let us with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like using pd.MultiIndex.from_tuples.

7.3.5.1. Creating multi-level index#

### index levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]


### In Python, zip is a built-in function that lets you combine 
### two or more lists (or other iterables) element-by-element into pairs (or tuples).
### list1 = [1, 2, 3, 4]
### list2 = ['a', 'b']
### print(list(zip(list1, list2))) ==> [(1, 'a'), (2, 'b')]

hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)

hier_index     ### a list of tuples

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

7.3.5.2. Creating MultiIndex DataFrame#

Pass the hier_index to index parameter; also add columns.

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df

		A	B
Group	Num
G1	1	-0.497104	-0.754070
	2	-0.943406	0.484752
	3	-0.116773	1.901755
G2	1	0.238127	1.996652
	2	-0.993263	0.196800
	3	-1.136645	0.000366

Now let’s show how to index this! For index hierarchy we use df.loc[ ], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1']

	A	B
1	0.302665	1.693723
2	-1.706086	-1.159119
3	-0.134841	0.390528

df.loc['G1'].loc[1]

A    0.302665
B    1.693723
Name: 1, dtype: float64

df.loc['G2'].loc[2]['B']

np.float64(0.07295967531703869)

7.3.5.3. Changing group names#

df.index.names

FrozenList(['Group', 'Num'])

df.index.names = ['Group','Num']

df

		A	B
Group	Num
G1	1	0.302665	1.693723
	2	-1.706086	-1.159119
	3	-0.134841	0.390528
G2	1	0.166905	0.184502
	2	0.807706	0.072960
	3	0.638787	0.329646

7.3.5.4. .xs( ) method#

The .xs() method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.

df.xs('G1')

	A	B
Num
1	0.302665	1.693723
2	-1.706086	-1.159119
3	-0.134841	0.390528

df.xs(1, level='Num')

	A	B
Group
G1	0.302665	1.693723
G2	0.166905	0.184502