7.3. Pandas DataFrames#
DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame can be thought of as a collection of Series objects combined into a single table, all sharing the same index. Let’s use Pandas to explore how DataFrames work!
import numpy as np
import pandas as pd
7.3.1. Reading Datasets#
Following the instruction in Handling Dataset , you are able to access the dataset files in your data directory.
7.3.1.1. read_csv( ) to read file#
The most common way to read a file is to use
read_csv
.
### the path Windows will resolve them correctly
df = pd.read_csv("../../data/data.csv")
df
Duration | Pulse | Maxpulse | Calories | |
---|---|---|---|---|
0 | 60 | 110 | 130 | 409.1 |
1 | 60 | 117 | 145 | 479 |
2 | 60 | 103 | 135 | 340 |
3 | 45 | 109 | 175 | 282.4 |
4 | 45 | 117 | 148 | 406 |
... | ... | ... | ... | ... |
334 | 60 | 105 | 140 | 290.8 |
335 | 60 | 110 | 145 | 300.4 |
336 | 60 | 115 | 145 | 310.2 |
337 | 75 | 120 | 150 | 320.4 |
338 | 75 | 125 | 150 | 330.4 |
339 rows × 4 columns
Now that we have the DataFrame object, let’s use df.info()
to get the:
data type
number of values in the column
Null values
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 339 entries, 0 to 338
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 339 non-null object
1 Pulse 339 non-null object
2 Maxpulse 339 non-null object
3 Calories 329 non-null object
dtypes: object(4)
memory usage: 10.7+ KB
7.3.1.2. df Output#
df
print(df): header(5) tail(5)
to_string() for whole df
pd.options.display.max_rows
df ### default view is table-like
### Pandas will give 5+5 == 10 rows by default
### the shape information is at the bottom
Duration | Pulse | Maxpulse | Calories | |
---|---|---|---|---|
0 | 60 | 110 | 130 | 409.1 |
1 | 60 | 117 | 145 | 479 |
2 | 60 | 103 | 135 | 340 |
3 | 45 | 109 | 175 | 282.4 |
4 | 45 | 117 | 148 | 406 |
... | ... | ... | ... | ... |
334 | 60 | 105 | 140 | 290.8 |
335 | 60 | 110 | 145 | 300.4 |
336 | 60 | 115 | 145 | 310.2 |
337 | 75 | 120 | 150 | 320.4 |
338 | 75 | 125 | 150 | 330.4 |
339 rows × 4 columns
print(df) ### print(), same information but not as pretty
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479
2 60 103 135 340
3 45 109 175 282.4
4 45 117 148 406
.. ... ... ... ...
334 60 105 140 290.8
335 60 110 145 300.4
336 60 115 145 310.2
337 75 120 150 320.4
338 75 125 150 330.4
[339 rows x 4 columns]
Sometimes, you need to see more than the default output. You use the pd.options.display.max_rows attribute.
print(pd.options.display.max_rows) ### 2. display max_rows is 60 by default (depending on the system)
60
pd.options.display.max_rows = 100 ### increase max_rows
### this line is commented out to avoid overwhelming output
### you can uncomment it to see the full output
# print(df.to_string()) ### 3. to_string() enabling print whole df
7.3.1.3. DataFrame from SQLite#
Now, let us try to load the chinook SQLite database that we have used in the SQL database activity. You will be able to appreciate how convenient Pandas makes loading data to be.
To connect to a sqlite3 database, we:
load sqlite3 library (import; the sqlite module is part of the Python standard library)
connect to the database by calling the
sqlite3.connect()
functionuse the Pandas function
read_sql_query()
to run your queryclose the connection using the sqlite3 close() function
### to check all tables just like "SHOW TABLES;" in MySQL
import sqlite3
conn = sqlite3.connect('../../data/chinook.db')
query = "SELECT name FROM sqlite_master WHERE type='table';"
pd.read_sql_query(query, conn)
name | |
---|---|
0 | Album |
1 | Artist |
2 | Customer |
3 | Employee |
4 | Genre |
5 | Invoice |
6 | InvoiceLine |
7 | MediaType |
8 | Playlist |
9 | PlaylistTrack |
10 | Track |
import sqlite3
conn = sqlite3.connect('../../data/chinook.db')
df = pd.read_sql_query("SELECT * FROM customer;", conn)
df.head()
CustomerId | FirstName | LastName | Company | Address | City | State | Country | PostalCode | Phone | Fax | SupportRepId | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Luís | Gonçalves | Embraer - Empresa Brasileira de Aeronáutica S.A. | Av. Brigadeiro Faria Lima, 2170 | São José dos Campos | SP | Brazil | 12227-000 | +55 (12) 3923-5555 | +55 (12) 3923-5566 | luisg@embraer.com.br | 3 |
1 | 2 | Leonie | Köhler | None | Theodor-Heuss-Straße 34 | Stuttgart | None | Germany | 70174 | +49 0711 2842222 | None | leonekohler@surfeu.de | 5 |
2 | 3 | François | Tremblay | None | 1498 rue Bélanger | Montréal | QC | Canada | H2G 1A7 | +1 (514) 721-4711 | None | ftremblay@gmail.com | 3 |
3 | 4 | Bjørn | Hansen | None | Ullevålsveien 14 | Oslo | None | Norway | 0171 | +47 22 44 22 22 | None | bjorn.hansen@yahoo.no | 4 |
4 | 5 | František | Wichterlová | JetBrains s.r.o. | Klanova 9/506 | Prague | None | Czech Republic | 14700 | +420 2 4172 5555 | +420 2 4172 5555 | frantisekw@jetbrains.com | 4 |
### you need to close the connection
conn.close()
### or use the with keyword to handle closing
sql = 'SELECT * FROM Artist LIMIT 10'
with sqlite3.connect("../../data/chinook.db") as conn:
df = pd.read_sql_query(sql, conn)
df
ArtistId | Name | |
---|---|---|
0 | 1 | AC/DC |
1 | 2 | Accept |
2 | 3 | Aerosmith |
3 | 4 | Alanis Morissette |
4 | 5 | Alice In Chains |
5 | 6 | Antônio Carlos Jobim |
6 | 7 | Apocalyptica |
7 | 8 | Audioslave |
8 | 9 | BackBeat |
9 | 10 | Billy Cobham |
7.3.2. Creating DataFrames#
7.3.2.1. Use the pd.DataFrame#
pd.DataFrame creates a DataFrame object.
### create the index object. You see that the pd.date_range() function gives
### a range of days just like a regular range() function
# dates = pd.date_range() ### shift tab to see doc
dates = pd.date_range("20250901", periods=6)
### evaluate variable dates; dtype is datetime64 and freq='D'
dates
DatetimeIndex(['2025-09-01', '2025-09-02', '2025-09-03', '2025-09-04',
'2025-09-05', '2025-09-06'],
dtype='datetime64[ns]', freq='D')
### check data type of object => index
type(dates)
pandas.core.indexes.datetimes.DatetimeIndex
### turing a string into a list => column label
list("ABCD")
['A', 'B', 'C', 'D']
### use pd.DataFrame( ) to create a DataFrame object
# pd.DataFrame ### shift tab to see doc
np.random.seed(101) ### seed keeps the "random" numbers constant
df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
2025-09-01 | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
2025-09-02 | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
2025-09-03 | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
2025-09-04 | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
2025-09-05 | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
2025-09-06 | 0.302665 | 1.693723 | -1.706086 | -1.159119 |
### check data type: pd.read_sql_query() returns a df "DataFrame"
type(df)
pandas.core.frame.DataFrame
7.3.3. DataFrame Indexing#
In this section, we examine the primary techniques for retrieving data stored within a Pandas DataFrame. Mastery of these methods is essential, as they allow us to isolate rows, columns, or specific values for further analysis.
7.3.3.1. Random Seed#
from numpy.random import randn ### we can import random to get randint & rand
np.random.seed(101)
randn(5, 4)
array([[ 2.70684984, 0.62813271, 0.90796945, 0.50382575],
[ 0.65111795, -0.31931804, -0.84807698, 0.60596535],
[-2.01816824, 0.74012206, 0.52881349, -0.58900053],
[ 0.18869531, -0.75887206, -0.93323722, 0.95505651],
[ 0.19079432, 1.97875732, 2.60596728, 0.68350889]])
### do it again
np.random.seed(101)
randn(5,4)
array([[ 2.70684984, 0.62813271, 0.90796945, 0.50382575],
[ 0.65111795, -0.31931804, -0.84807698, 0.60596535],
[-2.01816824, 0.74012206, 0.52881349, -0.58900053],
[ 0.18869531, -0.75887206, -0.93323722, 0.95505651],
[ 0.19079432, 1.97875732, 2.60596728, 0.68350889]])
### seed not in the same cell
randn(5,4)
array([[ 0.30266545, 1.69372293, -1.70608593, -1.15911942],
[-0.13484072, 0.39052784, 0.16690464, 0.18450186],
[ 0.80770591, 0.07295968, 0.63878701, 0.3296463 ],
[-0.49710402, -0.7540697 , -0.9434064 , 0.48475165],
[-0.11677332, 1.9017548 , 0.23812696, 1.99665229]])
7.3.3.2. Create DataFrame#
### DataFrame
np.random.seed(101)
### parameters: data, index, columns
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split()) ### or list('WXYZ')
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
7.3.3.3. Column Indexing#
7.3.3.3.1. Using column label#
### same bracket notation for indexing
df['W']
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
### this works but NOT RECOMMENDED
df.W
A 2.706850
B 0.651118
C -2.018168
D 0.188695
E 0.190794
Name: W, dtype: float64
7.3.3.3.2. Using a list of column labels#
# Pass a list of column names
df[['W','Z']]
W | Z | |
---|---|---|
A | 2.706850 | 0.503826 |
B | 0.651118 | 0.605965 |
C | -2.018168 | -0.589001 |
D | 0.188695 | 0.955057 |
E | 0.190794 | 0.683509 |
DataFrame Columns are just Series
type(df['W'])
pandas.core.series.Series
7.3.3.4. Row Indexing#
Accessing DataFrame rows with loc and iloc. The result of selection is Series.
df.loc['A']
W 2.706850
X 0.628133
Y 0.907969
Z 0.503826
Name: A, dtype: float64
ser = df.loc['A']
type(ser)
pandas.core.series.Series
Or select based off of position instead of label with iloc:
df.iloc[2]
W -2.018168
X 0.740122
Y 0.528813
Z -0.589001
Name: C, dtype: float64
7.3.3.5. Selecting subset of rows and columns#
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df.loc['B','Y']
np.float64(-0.8480769834036315)
df.loc[['A','B'],['W','Y']]
W | Y | |
---|---|---|
A | 2.706850 | 0.907969 |
B | 0.651118 | -0.848077 |
7.3.3.6. Conditional Selection#
An important feature of pandas is conditional selection using bracket notation, very similar to numpy:
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df>0
W | X | Y | Z | |
---|---|---|---|---|
A | True | True | True | True |
B | True | False | False | True |
C | False | True | True | False |
D | True | False | False | True |
E | True | True | True | True |
df[df>0]
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | NaN | NaN | 0.605965 |
C | NaN | 0.740122 | 0.528813 | NaN |
D | 0.188695 | NaN | NaN | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df[df['W']>0]
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df[df['W']>0]['Y']
A 0.907969
B -0.848077
D -0.933237
E 2.605967
Name: Y, dtype: float64
df[df['W']>0][['Y','X']]
Y | X | |
---|---|---|
A | 0.907969 | 0.628133 |
B | -0.848077 | -0.319318 |
D | -0.933237 | -0.758872 |
E | 2.605967 | 1.978757 |
For two conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
W | X | Y | Z | |
---|---|---|---|---|
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
7.3.3.7. Row/Column Opeations#
7.3.3.7.1. Creating a new column:#
Note that this is in-place.
df['new'] = df['W'] + df['Y']
df
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
# arr = np.arange(12).reshape(3, 4)
# arr
7.3.3.7.2. Dropping columns#
Using the drop function. Note that you have to specify axis=1 to drop the column label.
# df['new'] = df['W'] + df['Y']
df.drop('new',axis=1)
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
# Not inplace unless specified!
df
W | X | Y | Z | new | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | 3.614819 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | -0.196959 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | -1.489355 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | -0.744542 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | 2.796762 |
DataFrame functions mostly require inplace=True for in-place operation
df.drop('new',axis=1,inplace=True)
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
### new is gone
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
7.3.3.7.3. Dropping rows#
Note that axis=0.
df.drop('E', axis=0)
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
7.3.4. More Index Details#
Let’s discuss some more features of indexing, including resetting the index or setting it something else. We’ll also talk about index hierarchy!
df
W | X | Y | Z | |
---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
# Reset to default 0,1...n index
df.reset_index()
index | W | X | Y | Z | |
---|---|---|---|---|---|
0 | A | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
1 | B | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
2 | C | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
3 | D | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
4 | E | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
newind = 'CA NY WY OR CO'.split()
df['States'] = newind
df
W | X | Y | Z | States | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States')
W | X | Y | Z | |
---|---|---|---|---|
States | ||||
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
df
W | X | Y | Z | States | |
---|---|---|---|---|---|
A | 2.706850 | 0.628133 | 0.907969 | 0.503826 | CA |
B | 0.651118 | -0.319318 | -0.848077 | 0.605965 | NY |
C | -2.018168 | 0.740122 | 0.528813 | -0.589001 | WY |
D | 0.188695 | -0.758872 | -0.933237 | 0.955057 | OR |
E | 0.190794 | 1.978757 | 2.605967 | 0.683509 | CO |
df.set_index('States',inplace=True)
df
W | X | Y | Z | |
---|---|---|---|---|
States | ||||
CA | 2.706850 | 0.628133 | 0.907969 | 0.503826 |
NY | 0.651118 | -0.319318 | -0.848077 | 0.605965 |
WY | -2.018168 | 0.740122 | 0.528813 | -0.589001 |
OR | 0.188695 | -0.758872 | -0.933237 | 0.955057 |
CO | 0.190794 | 1.978757 | 2.605967 | 0.683509 |
7.3.5. Multi-Index and Index Hierarchy#
Hierarchical Indexing (or MultiIndex) gives us multiple levels of row or column labels in a DataFrame, which is very useful for working with grouped or multidimensional data.
Let us with Multi-Index, first we’ll create a quick example of what a Multi-Indexed DataFrame would look like using pd.MultiIndex.from_tuples
.
7.3.5.1. Creating multi-level index#
### index levels
outside = ['G1','G1','G1','G2','G2','G2']
inside = [1,2,3,1,2,3]
### In Python, zip is a built-in function that lets you combine
### two or more lists (or other iterables) element-by-element into pairs (or tuples).
### list1 = [1, 2, 3, 4]
### list2 = ['a', 'b']
### print(list(zip(list1, list2))) ==> [(1, 'a'), (2, 'b')]
hier_index = list(zip(outside,inside))
hier_index = pd.MultiIndex.from_tuples(hier_index)
hier_index ### a list of tuples
MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
)
7.3.5.2. Creating MultiIndex DataFrame#
Pass the hier_index to index parameter; also add columns.
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df
A | B | ||
---|---|---|---|
Group | Num | ||
G1 | 1 | -0.497104 | -0.754070 |
2 | -0.943406 | 0.484752 | |
3 | -0.116773 | 1.901755 | |
G2 | 1 | 0.238127 | 1.996652 |
2 | -0.993263 | 0.196800 | |
3 | -1.136645 | 0.000366 |
Now let’s show how to index this! For index hierarchy we use df.loc[ ], if this was on the columns axis, you would just use normal bracket notation df[]. Calling one level of the index returns the sub-dataframe:
df.loc['G1']
A | B | |
---|---|---|
1 | 0.302665 | 1.693723 |
2 | -1.706086 | -1.159119 |
3 | -0.134841 | 0.390528 |
df.loc['G1'].loc[1]
A 0.302665
B 1.693723
Name: 1, dtype: float64
df.loc['G2'].loc[2]['B']
np.float64(0.07295967531703869)
7.3.5.3. Changing group names#
df.index.names
FrozenList(['Group', 'Num'])
df.index.names = ['Group','Num']
df
A | B | ||
---|---|---|---|
Group | Num | ||
G1 | 1 | 0.302665 | 1.693723 |
2 | -1.706086 | -1.159119 | |
3 | -0.134841 | 0.390528 | |
G2 | 1 | 0.166905 | 0.184502 |
2 | 0.807706 | 0.072960 | |
3 | 0.638787 | 0.329646 |
7.3.5.4. .xs( ) method#
The .xs()
method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.
df.xs('G1')
A | B | |
---|---|---|
Num | ||
1 | 0.302665 | 1.693723 |
2 | -1.706086 | -1.159119 |
3 | -0.134841 | 0.390528 |
df.xs(1, level='Num')
A | B | |
---|---|---|
Group | ||
G1 | 0.302665 | 1.693723 |
G2 | 0.166905 | 0.184502 |