7.3. Pandas DataFrames#

DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame can be thought of as a collection of Series objects combined into a single table, all sharing the same index. Let’s use Pandas to explore how DataFrames work!

import numpy as np
import pandas as pd

7.3.1. Reading Datasets#

Following the instruction in Handling Dataset , you are able to access the dataset files in your data directory.

7.3.1.1. read_csv( ) to read file#

  • The most common way to read a file is to use read_csv.

### the path Windows will resolve them correctly

df = pd.read_csv("../../data/data.csv")
df
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479
2 60 103 135 340
3 45 109 175 282.4
4 45 117 148 406
... ... ... ... ...
334 60 105 140 290.8
335 60 110 145 300.4
336 60 115 145 310.2
337 75 120 150 320.4
338 75 125 150 330.4

339 rows × 4 columns

Now that we have the DataFrame object, let’s use df.info() to get the:

  1. data type

  2. number of values in the column

  3. Null values

df.info() 
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339 entries, 0 to 338
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   Duration  339 non-null    object
 1   Pulse     339 non-null    object
 2   Maxpulse  339 non-null    object
 3   Calories  329 non-null    object
dtypes: object(4)
memory usage: 10.7+ KB

7.3.1.2. df Output#

  1. df

  2. print(df): header(5) tail(5)

  3. to_string() for whole df

  4. pd.options.display.max_rows

df      ### default view is table-like
        ### Pandas will give 5+5 == 10 rows by default
        ### the shape information is at the bottom
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479
2 60 103 135 340
3 45 109 175 282.4
4 45 117 148 406
... ... ... ... ...
334 60 105 140 290.8
335 60 110 145 300.4
336 60 115 145 310.2
337 75 120 150 320.4
338 75 125 150 330.4

339 rows × 4 columns

print(df)   ### print(), same information but not as pretty
            
    Duration Pulse Maxpulse Calories
0         60   110      130    409.1
1         60   117      145      479
2         60   103      135      340
3         45   109      175    282.4
4         45   117      148      406
..       ...   ...      ...      ...
334       60   105      140    290.8
335       60   110      145    300.4
336       60   115      145    310.2
337       75   120      150    320.4
338       75   125      150    330.4

[339 rows x 4 columns]

Sometimes, you need to see more than the default output. You use the pd.options.display.max_rows attribute.

print(pd.options.display.max_rows)     ### 2. display max_rows is 60 by default (depending on the system)
60
pd.options.display.max_rows =  100     ### increase max_rows
### this line is commented out to avoid overwhelming output
### you can uncomment it to see the full output

# print(df.to_string())          ### 3. to_string() enabling print whole df

7.3.1.3. DataFrame from SQLite#

Now, let us try to load the chinook SQLite database that we have used in the SQL database activity. You will be able to appreciate how convenient Pandas makes loading data to be.

To connect to a sqlite3 database, we:

  1. load sqlite3 library (import; the sqlite module is part of the Python standard library)

  2. connect to the database by calling the sqlite3.connect() function

  3. use the Pandas function read_sql_query() to run your query

  4. close the connection using the sqlite3 close() function

### to check all tables just like "SHOW TABLES;" in MySQL

import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

query = "SELECT name FROM sqlite_master WHERE type='table';"

pd.read_sql_query(query, conn)
name
0 Album
1 Artist
2 Customer
3 Employee
4 Genre
5 Invoice
6 InvoiceLine
7 MediaType
8 Playlist
9 PlaylistTrack
10 Track
import sqlite3
conn = sqlite3.connect('../../data/chinook.db')

df = pd.read_sql_query("SELECT * FROM customer;", conn)
df.head()
CustomerId FirstName LastName Company Address City State Country PostalCode Phone Fax Email SupportRepId
0 1 Luís Gonçalves Embraer - Empresa Brasileira de Aeronáutica S.A. Av. Brigadeiro Faria Lima, 2170 São José dos Campos SP Brazil 12227-000 +55 (12) 3923-5555 +55 (12) 3923-5566 luisg@embraer.com.br 3
1 2 Leonie Köhler None Theodor-Heuss-Straße 34 Stuttgart None Germany 70174 +49 0711 2842222 None leonekohler@surfeu.de 5
2 3 François Tremblay None 1498 rue Bélanger Montréal QC Canada H2G 1A7 +1 (514) 721-4711 None ftremblay@gmail.com 3
3 4 Bjørn Hansen None Ullevålsveien 14 Oslo None Norway 0171 +47 22 44 22 22 None bjorn.hansen@yahoo.no 4
4 5 František Wichterlová JetBrains s.r.o. Klanova 9/506 Prague None Czech Republic 14700 +420 2 4172 5555 +420 2 4172 5555 frantisekw@jetbrains.com 4
### you need to close the connection

conn.close()
### or use the with keyword to handle closing

sql = 'SELECT * FROM Artist LIMIT 10'
with sqlite3.connect("../../data/chinook.db") as conn:
    df = pd.read_sql_query(sql, conn)

df
ArtistId Name
0 1 AC/DC
1 2 Accept
2 3 Aerosmith
3 4 Alanis Morissette
4 5 Alice In Chains
5 6 Antônio Carlos Jobim
6 7 Apocalyptica
7 8 Audioslave
8 9 BackBeat
9 10 Billy Cobham

7.3.2. Creating DataFrames#

7.3.2.1. Use the pd.DataFrame#

pd.DataFrame creates a DataFrame object.

### create the index object. You see that the pd.date_range() function gives 
### a range of days just like a regular range() function

# dates = pd.date_range()     ### shift tab to see doc
dates = pd.date_range("20250901", periods=6)
### evaluate variable dates; dtype is datetime64 and freq='D'

dates
DatetimeIndex(['2025-09-01', '2025-09-02', '2025-09-03', '2025-09-04',
               '2025-09-05', '2025-09-06'],
              dtype='datetime64[ns]', freq='D')
### check data type of object => index

type(dates)
pandas.core.indexes.datetimes.DatetimeIndex
### turing a string into a list => column label

list("ABCD")
['A', 'B', 'C', 'D']
### use pd.DataFrame( ) to create a DataFrame object

# pd.DataFrame    ### shift tab to see doc

np.random.seed(101)      ### seed keeps the "random" numbers constant

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
A B C D
2025-09-01 2.706850 0.628133 0.907969 0.503826
2025-09-02 0.651118 -0.319318 -0.848077 0.605965
2025-09-03 -2.018168 0.740122 0.528813 -0.589001
2025-09-04 0.188695 -0.758872 -0.933237 0.955057
2025-09-05 0.190794 1.978757 2.605967 0.683509
2025-09-06 0.302665 1.693723 -1.706086 -1.159119
### check data type: pd.read_sql_query() returns a df "DataFrame"

type(df)
pandas.core.frame.DataFrame

7.3.3. DataFrame Indexing#

In this section, we examine the primary techniques for retrieving data stored within a Pandas DataFrame. Mastery of these methods is essential, as they allow us to isolate rows, columns, or specific values for further analysis.

7.3.3.1. Random Seed#

from numpy.random import randn    ### we can import random to get randint & rand

np.random.seed(101)
randn(5, 4)
array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])
### do it again
np.random.seed(101)
randn(5,4)
array([[ 2.70684984,  0.62813271,  0.90796945,  0.50382575],
       [ 0.65111795, -0.31931804, -0.84807698,  0.60596535],
       [-2.01816824,  0.74012206,  0.52881349, -0.58900053],
       [ 0.18869531, -0.75887206, -0.93323722,  0.95505651],
       [ 0.19079432,  1.97875732,  2.60596728,  0.68350889]])
### seed not in the same cell
randn(5,4)
array([[ 0.30266545,  1.69372293, -1.70608593, -1.15911942],
       [-0.13484072,  0.39052784,  0.16690464,  0.18450186],
       [ 0.80770591,  0.07295968,  0.63878701,  0.3296463 ],
       [-0.49710402, -0.7540697 , -0.9434064 ,  0.48475165],
       [-0.11677332,  1.9017548 ,  0.23812696,  1.99665229]])

7.3.3.2. Create DataFrame#

### DataFrame
np.random.seed(101)

### parameters: data, index, columns
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())   ### or list('WXYZ')
df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

7.3.3.3. Column Indexing#

7.3.3.3.1. Using column label#

### same bracket notation for indexing

df['W']
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64
### this works but NOT RECOMMENDED

df.W
A    2.706850
B    0.651118
C   -2.018168
D    0.188695
E    0.190794
Name: W, dtype: float64

7.3.3.3.2. Using a list of column labels#

# Pass a list of column names

df[['W','Z']]
W Z
A 2.706850 0.503826
B 0.651118 0.605965
C -2.018168 -0.589001
D 0.188695 0.955057
E 0.190794 0.683509

DataFrame Columns are just Series

type(df['W'])
pandas.core.series.Series

7.3.3.4. Row Indexing#

Accessing DataFrame rows with loc and iloc. The result of selection is Series.

df.loc['A']
W    2.706850
X    0.628133
Y    0.907969
Z    0.503826
Name: A, dtype: float64
ser = df.loc['A']
type(ser)
pandas.core.series.Series

Or select based off of position instead of label with iloc:

df.iloc[2]
W   -2.018168
X    0.740122
Y    0.528813
Z   -0.589001
Name: C, dtype: float64

7.3.3.5. Selecting subset of rows and columns#

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df.loc['B','Y']
np.float64(-0.8480769834036315)
df.loc[['A','B'],['W','Y']]
W Y
A 2.706850 0.907969
B 0.651118 -0.848077

7.3.3.6. Conditional Selection#

An important feature of pandas is conditional selection using bracket notation, very similar to numpy:

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df>0
W X Y Z
A True True True True
B True False False True
C False True True False
D True False False True
E True True True True
df[df>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 NaN NaN 0.605965
C NaN 0.740122 0.528813 NaN
D 0.188695 NaN NaN 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
df[df['W']>0]['Y']
A    0.907969
B   -0.848077
D   -0.933237
E    2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y X
A 0.907969 0.628133
B -0.848077 -0.319318
D -0.933237 -0.758872
E 2.605967 1.978757

For two conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]
W X Y Z
E 0.190794 1.978757 2.605967 0.683509

7.3.3.7. Row/Column Opeations#

7.3.3.7.1. Creating a new column:#

Note that this is in-place.

df['new'] = df['W'] + df['Y']
df
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762
# arr = np.arange(12).reshape(3, 4)
# arr

7.3.3.7.2. Dropping columns#

Using the drop function. Note that you have to specify axis=1 to drop the column label.

# df['new'] = df['W'] + df['Y']
df.drop('new',axis=1)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Not inplace unless specified!
df
W X Y Z new
A 2.706850 0.628133 0.907969 0.503826 3.614819
B 0.651118 -0.319318 -0.848077 0.605965 -0.196959
C -2.018168 0.740122 0.528813 -0.589001 -1.489355
D 0.188695 -0.758872 -0.933237 0.955057 -0.744542
E 0.190794 1.978757 2.605967 0.683509 2.796762

DataFrame functions mostly require inplace=True for in-place operation

df.drop('new',axis=1,inplace=True)

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
### new is gone
df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509

7.3.3.7.3. Dropping rows#

Note that axis=0.

df.drop('E', axis=0)
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057

7.3.4. More Index Details#

Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!

df
W X Y Z
A 2.706850 0.628133 0.907969 0.503826
B 0.651118 -0.319318 -0.848077 0.605965
C -2.018168 0.740122 0.528813 -0.589001
D 0.188695 -0.758872 -0.933237 0.955057
E 0.190794 1.978757 2.605967 0.683509
# Reset to default 0,1...n index
df.reset_index()
index W X Y Z
0 A 2.706850 0.628133 0.907969 0.503826
1 B 0.651118 -0.319318 -0.848077 0.605965
2 C -2.018168 0.740122 0.528813 -0.589001
3 D 0.188695 -0.758872 -0.933237 0.955057
4 E 0.190794 1.978757 2.605967 0.683509
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States')
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509
df
W X Y Z States
A 2.706850 0.628133 0.907969 0.503826 CA
B 0.651118 -0.319318 -0.848077 0.605965 NY
C -2.018168 0.740122 0.528813 -0.589001 WY
D 0.188695 -0.758872 -0.933237 0.955057 OR
E 0.190794 1.978757 2.605967 0.683509 CO
df.set_index('States',inplace=True)
df
W X Y Z
States
CA 2.706850 0.628133 0.907969 0.503826
NY 0.651118 -0.319318 -0.848077 0.605965
WY -2.018168 0.740122 0.528813 -0.589001
OR 0.188695 -0.758872 -0.933237 0.955057
CO 0.190794 1.978757 2.605967 0.683509

7.3.5. Multi-Index and Index Hierarchy#

Hierarchical Indexing (or MultiIndex) gives us multiple levels of row or column labels in a DataFrame, which is very useful for working with grouped or multidimensional data.

Let us with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like using pd.MultiIndex.from_tuples.

7.3.5.1. Creating multi-level index#

### index levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]


### In Python, zip is a built-in function that lets you combine 
### two or more lists (or other iterables) element-by-element into pairs (or tuples).
### list1 = [1, 2, 3, 4]
### list2 = ['a', 'b']
### print(list(zip(list1, list2))) ==> [(1, 'a'), (2, 'b')]

hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index     ### a list of tuples
MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           )

7.3.5.2. Creating MultiIndex DataFrame#

Pass the hier_index to index parameter; also add columns.

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
A B
Group Num
G1 1 -0.497104 -0.754070
2 -0.943406 0.484752
3 -0.116773 1.901755
G2 1 0.238127 1.996652
2 -0.993263 0.196800
3 -1.136645 0.000366

Now let’s show how to index this! For index hierarchy we use df.loc[ ], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:

df.loc['G1']
A B
1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
df.loc['G1'].loc[1]
A    0.302665
B    1.693723
Name: 1, dtype: float64
df.loc['G2'].loc[2]['B']
np.float64(0.07295967531703869)

7.3.5.3. Changing group names#

df.index.names
FrozenList(['Group', 'Num'])
df.index.names = ['Group','Num']
df
A B
Group Num
G1 1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
G2 1 0.166905 0.184502
2 0.807706 0.072960
3 0.638787 0.329646

7.3.5.4. .xs( ) method#

The .xs() method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.

df.xs('G1')
A B
Num
1 0.302665 1.693723
2 -1.706086 -1.159119
3 -0.134841 0.390528
df.xs(1, level='Num')
A B
Group
G1 0.302665 1.693723
G2 0.166905 0.184502