4.3. DataFrames#
Pandas DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame is a two-dimensional, labeled data structure that can be thought of as:
A collection of Series objects combined into a single table, all sharing the same index
A spreadsheet or SQL table with rows and columns
A dictionary of Series, where each column is a Series with a shared index
Key Features of DataFrames
Labeled axes: Both rows (index) and columns have labels, making data access intuitive and explicit
Heterogeneous data: Different columns can contain different data types (integers, floats, strings, dates, etc.)
Size-mutable: You can insert and delete columns
Powerful operations: Built-in methods for filtering, grouping, merging, reshaping, and statistical analysis
Alignment of data based on labels
Missing data handling (see 0404-missing-data.ipynb for details)
Rich I/O tools for reading/writing various formats (CSV, Excel, SQL, JSON, etc.)
Why DataFrames?
While NumPy arrays are excellent for numerical computation, DataFrames offer several advantages:
Column names instead of just numeric indices
Mixed data types in different columns
Accessing and selecting data
Modifying DataFrames (adding/removing columns and rows)
Sorting and filtering
Working with indices, including MultiIndex
What You’ll Learn
In this notebook, we’ll explore:
Creating DataFrames from various sources
DataFrame attributes and methods
Indexing and selecting data
Sorting and filtering
Working with indices, including MultiIndex
Note:
Missing data handling is covered in detail in notebook 0404-missing-data.ipynb.
Additional Resources:
Let’s use Pandas to explore how DataFrames work!
import numpy as np
import pandas as pd
4.3.1. Creating DataFrames#
There are several common ways to create a DataFrame in pandas:
Method |
Description |
|---|---|
|
Load data from a CSV file — the most common approach in practice |
|
Build a DataFrame from a Python dictionary (keys become column names) |
|
Build from a 2D list, optionally specifying column names |
|
Build from a NumPy array, useful for numeric/randomly generated data |
In this section we’ll cover the two most common: reading from a file and constructing one from scratch.
4.3.1.1. pd.read_csv( )#
The most common way to read a file is to use
read_csv()function.
# Read CSV file - both forward slashes (/) work on Windows, macOS, and Linux
df_sample = pd.read_csv("../../data/data.csv")
4.3.1.1.1. DataFrame Output and Preview#
To preview a dataframe, we usually use the .head() and .tail() methods.
There are several ways to display a DataFrame, each with different formatting and truncation behavior:
Method |
Description |
|---|---|
|
Displays DataFrame in table format with first/last 5 rows by default |
|
Prints text representation showing header(5) and tail(5) rows |
|
Converts entire DataFrame to string without truncation |
|
Controls how many rows are displayed before truncation |
To display a DataFrame, simply type its name:
df_sample # Default view is table-like
# Pandas will show first 5 and last 5 rows (10 total) by default
# The shape information is displayed at the bottom
| Duration | Pulse | Maxpulse | Calories | |
|---|---|---|---|---|
| 0 | 60 | 110 | 130 | 409.1 |
| 1 | 60 | 117 | 145 | 479.0 |
| 2 | 60 | 103 | 135 | 340.0 |
| 3 | 45 | 109 | 175 | 282.4 |
| 4 | 45 | 117 | 148 | 406.0 |
| ... | ... | ... | ... | ... |
| 160 | 30 | 85 | 120 | 250.4 |
| 161 | 45 | 90 | 130 | 260.4 |
| 162 | 45 | 95 | 130 | 270.0 |
| 163 | 45 | 100 | 140 | 280.9 |
| 164 | 60 | 105 | 140 | 290.8 |
165 rows × 4 columns
Using print() displays the DataFrame in plain text format:
print(df_sample) # print() shows same information but in plain text format
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
.. ... ... ... ...
160 30 85 120 250.4
161 45 90 130 260.4
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8
[165 rows x 4 columns]
.to_string()
.to_string()converts the entire DataFrame to a string without truncation.useful when you need the complete string representation
the sample below is commented out because it will show the whole 339 rows of data.
### The to_string() method is used to display the entire
### DataFrame as a string, without truncation.
### commented out to avoid overwhelming output in case of large DataFrame
# print(df_sample.to_string())
Pandas has a display.max_rows setting that is typically 60 line by default (system-dependent) and that’s the max number of lines you can print. When printing more than the max_rows, you get the default 10 (first 5 + last 5 ) rows. You need to change the pd.options.display.max_rows setting if you need to print more lines.
print(pd.options.display.max_rows, "\n") ### current max_rows setting (default is 60)
print(df_sample.head(11), "\n") ### 11 is less than max_rows, will print
# print(df_sample.head(100), "\n") ### 100 is more than max_rows, show first 5 + last 5 rows
60
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
5 60 102 127 300.5
6 60 110 136 374.0
7 45 104 134 253.3
8 30 109 133 195.1
9 60 98 124 269.0
10 60 103 147 329.3
When working with large datasets, you often want to see just a sample of the data rather than the entire DataFrame.
# View first 5 rows (default)
df_sample.head()
| Duration | Pulse | Maxpulse | Calories | |
|---|---|---|---|---|
| 0 | 60 | 110 | 130 | 409.1 |
| 1 | 60 | 117 | 145 | 479.0 |
| 2 | 60 | 103 | 135 | 340.0 |
| 3 | 45 | 109 | 175 | 282.4 |
| 4 | 45 | 117 | 148 | 406.0 |
# View last 3 rows
df_sample.tail(3)
| Duration | Pulse | Maxpulse | Calories | |
|---|---|---|---|---|
| 162 | 45 | 95 | 130 | 270.0 |
| 163 | 45 | 100 | 140 | 280.9 |
| 164 | 60 | 105 | 140 | 290.8 |
For this exercise, either
download baby.csvand place it in your projectdatafolder, orjust use
pd.read_csv("../../data/baby.csv")in the Live Code cell to access thebaby.csvdataset.
### EXERCISE: Reading CSV Files
# Using the data/baby.csv file:
# 1. Read the CSV file into a DataFrame called df_baby
# 2. Display the first 3 rows (use head() with appropriate argument)
# 3. Display the last 2 rows (use tail() with appropriate argument)
### Your code starts here:
### Your code ends here.
First 3 rows:
Birth Weight Gestational Days Maternal Age Maternal Height \
0 120 284 27 62
1 113 282 33 64
2 128 279 28 64
Maternal Pregnancy Weight Maternal Smoker
0 100 False
1 135 False
2 115 True
Last 2 rows:
Birth Weight Gestational Days Maternal Age Maternal Height \
1172 125 281 21 65
1173 117 297 38 65
Maternal Pregnancy Weight Maternal Smoker
1172 110 False
1173 129 False
4.3.1.2. Creating a DataFrame from NumPy Arrays#
So far we’ve been working with df_sample, which was loaded from a CSV file. Now let’s create a brand-new DataFrame from scratch using randomly generated data and a date-based index.
We use pd.date_range() to generate a sequence of consecutive dates, and np.random.default_rng(42) — the modern NumPy random number Generator — to fill the DataFrame with random integers. Passing a seed (42) ensures reproducibility so the same values are produced every time the code is run.
# np.random.seed(42) ### old way to set random seed, still works but will show a warning in newer versions of NumPy
rng = np.random.default_rng(42)
dates = pd.date_range("2020-01-01", periods=4) ### date_range() generates a sequence of dates starting from "2020-01-01" for 4 periods (4 days)
df_dates = pd.DataFrame(rng.integers(10, 100, size=(4, 6 )), index=dates, columns=list("ABCDEF"))
df_dates
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 2020-01-01 | 18 | 79 | 68 | 49 | 48 | 87 |
| 2020-01-02 | 17 | 72 | 28 | 18 | 57 | 97 |
| 2020-01-03 | 76 | 78 | 74 | 80 | 56 | 21 |
| 2020-01-04 | 85 | 50 | 55 | 43 | 26 | 93 |
4.3.1.3. Creating a DataFrame from a Dictionary#
The most common way to create a DataFrame from scratch is to pass a dictionary where each key becomes a column name and its value is a list of column values. You can also pass a custom index to label the rows.
df_students = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [20, 22, 21, 23],
'City': ['NYC', 'LA', 'Chicago', 'Boston'],
'Score': [85, 92, 67, 78]
}, index=['S1', 'S2', 'S3', 'S4'])
df_students
| Name | Age | City | Score | |
|---|---|---|---|---|
| S1 | Alice | 20 | NYC | 85 |
| S2 | Bob | 22 | LA | 92 |
| S3 | Charlie | 21 | Chicago | 67 |
| S4 | Diana | 23 | Boston | 78 |
You can add a new column using a boolean condition.
df_students['Pass'] = df_students['Score'] >= 70
df_students
| Name | Age | City | Score | Pass | |
|---|---|---|---|---|---|
| S1 | Alice | 20 | NYC | 85 | True |
| S2 | Bob | 22 | LA | 92 | True |
| S3 | Charlie | 21 | Chicago | 67 | False |
| S4 | Diana | 23 | Boston | 78 | True |
Create DataFrame with employee salary data (in thousands)
df_sal = pd.DataFrame({
'Salary': [75, 82, 68, 91, 55],
'Bonus': [8.5, 12.3, 5.2, 15.8, 3.1],
'Years': [3, 5, 2, 7, 1],
'Rating': [4.2, 4.7, 3.8, 4.9, 3.5]
}, index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'])
df_sal
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
4.3.2. DataFrame Attributes & Methods#
DataFrames have many useful attributes and methods that help us understand the structure and content of our data. Let’s explore the most commonly used ones.
4.3.2.1. The dtypes Attribute#
Note that Python
The .dtypes attribute shows the data type of each column. Pandas infers types automatically when creating or loading a DataFrame. The most common types are:
dtype |
Description |
Example values |
|---|---|---|
|
Integer numbers |
|
|
Decimal numbers |
|
|
Strings (or mixed types) |
|
|
Boolean values |
|
|
Dates and times |
|
Getting the dtypes right matters — numeric operations like .mean() won’t work on object columns, and storage is more efficient with proper types.
Type Concept |
Comes From |
Used For |
Returns / Represents |
|---|---|---|---|
|
Python |
Built-in scalar data types |
A Python object (e.g., |
|
NumPy |
Fixed-width numeric storage in arrays |
A NumPy scalar type (e.g., |
|
NumPy |
Describing the data type of a single array/Series |
A NumPy |
|
pandas |
Viewing all column data types in a DataFrame |
A pandas Series of NumPy |
To get the dtypes of the salary dataframe, you issue df.dtypes and pandas returns a Series object of column dtypes.
df_sal
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
df_sal.dtypes
Salary int64
Bonus float64
Years int64
Rating float64
dtype: object
When you select a column (Series) from a dataframe, the dtype of the dataframe will be shown at the end of the displayed Series. Or you can use the .dtype property to access the dtype.
df_sal['Salary'] ### ### Use bracket notation to select a single column
Alice 75
Bob 82
Charlie 68
Diana 91
Eve 55
Name: Salary, dtype: int64
df_sal['Salary'].dtype ### Get the data type of the 'Salary' column (SERIES)
dtype('int64')
4.3.2.2. Shape and Size#
Understanding the dimensions of a DataFrame is essential when working with real-world datasets. The .shape, .size, and len() properties give you a quick overview of how much data you’re working with.
Item |
Type |
Parentheses? |
Returns |
Meaning |
|---|---|---|---|---|
|
df property (attribute) |
No |
Tuple |
Dimensions of the DataFrame |
|
df property (attribute) |
No |
Integer |
Total number of elements ( |
|
Python built-in function |
Yes |
Integer |
Number of rows |
|
df method |
Yes |
DataFrame |
First 5 rows (by default) |
# Get the dimensions of the DataFrame (rows, columns)
df_sample.shape
print(df_sample.shape)
# Get the total number of elements (rows * columns)
df_sample.size
print(df_sample.size)
# Get the number of rows
len(df_sample)
print(len(df_sample))
(165, 4)
660
165
4.3.2.3. Columns and Index#
Every DataFrame has two key structural attributes: .columns, which holds the column labels, and .index, which labels the rows. Knowing how to access and work with these is fundamental to selecting and manipulating data.
# Get all column names
df_sample.columns
Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')
# Get the index (row labels)
df_sample.index
RangeIndex(start=0, stop=165, step=1)
4.3.2.4. Statistical Summary#
The .info() method gives you a concise overview of a DataFrame’s structure — including column names, data types, and non-null counts. The .describe() method then computes summary statistics (count, mean, min, max, etc.) for each column. Together, these are usually the first two things you run when exploring a new dataset.
4.3.2.4.1. The info() Method#
Now that we have the DataFrame object, let’s use df.info() to get the:
data type
number of values in the column
Null values
# df_sample.drop(df_sample.index[df_sample.index > 169], inplace=True)
df_sample.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Duration 165 non-null int64
1 Pulse 165 non-null int64
2 Maxpulse 165 non-null int64
3 Calories 160 non-null float64
dtypes: float64(1), int64(3)
memory usage: 5.3 KB
4.3.2.4.2. The describe() Method#
.describe() generates summary statistics for all numeric columns in one shot. It returns:
Statistic |
Description |
|---|---|
|
Number of non-null values |
|
Average value |
|
Standard deviation (spread of values) |
|
Smallest and largest values |
|
Quartiles — the value below which 25%, 50%, and 75% of the data falls |
The 50% row is the median. If mean and median differ significantly, the distribution is likely skewed.
df_sample.describe() # Get summary statistics for numeric columns
| Duration | Pulse | Maxpulse | Calories | |
|---|---|---|---|---|
| count | 165.000000 | 165.000000 | 165.000000 | 160.000000 |
| mean | 63.757576 | 107.218182 | 133.721212 | 377.311250 |
| std | 42.792772 | 14.574063 | 16.508899 | 269.526378 |
| min | 15.000000 | 80.000000 | 100.000000 | 50.300000 |
| 25% | 45.000000 | 100.000000 | 124.000000 | 250.625000 |
| 50% | 60.000000 | 105.000000 | 131.000000 | 318.600000 |
| 75% | 60.000000 | 111.000000 | 140.000000 | 388.200000 |
| max | 300.000000 | 159.000000 | 184.000000 | 1860.400000 |
For this exercise, either
download baby.csvand place it in your projectdatafolder, orjust use
pd.read_csv("../../data/actors.csv")in the Live Code cell to access thebaby.csvdataset.
### EXERCISE: Exploring DataFrame Attributes
#
# Using the actors.csv file from the data directory:
# 1. Read the CSV file into a DataFrame called df_actors
# 2. Find the number of rows and columns (shape)
# 3. Check the data types of each column
# 4. Get a statistical summary of the DataFrame
# 5. Display the index values
# 6. Calculate the total number of elements
#
### Your code starts here:
### Your code ends here.
Shape (rows, columns): (50, 6)
Data types:
Actor object
Total Gross float64
Number of Movies int64
Average per Movie float64
#1 Movie object
Gross float64
dtype: object
Statistical summary:
Total Gross Number of Movies Average per Movie Gross
count 50.000000 50.000000 50.000000 50.000000
mean 3034.088000 37.880000 93.518000 428.690000
std 591.528271 14.122843 58.509175 184.344412
min 2416.500000 7.000000 39.000000 169.700000
25% 2612.425000 27.500000 67.550000 279.300000
50% 2824.500000 37.000000 79.050000 419.150000
75% 3185.800000 43.750000 100.125000 519.800000
max 4871.700000 79.000000 451.800000 936.700000
Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]
Total elements: 300
4.3.3. DataFrame Selection and Indexing#
Selection and indexing are fundamental operations in pandas that allow you to extract specific portions of your data. Understanding these operations is crucial for effective data analysis and manipulation.
Key Concepts
Selection refers to extracting data from a DataFrame—whether that’s entire columns, specific rows, or individual cells. Pandas provides several methods for selection, each with its own use cases:
Bracket notation
[]: The simplest method for selecting columnsdf['column']→ selects a single column (returns a Series)df[['col1', 'col2']]→ selects multiple columns (returns a DataFrame)Note: Use double brackets
[[]]for multiple columns to get a DataFrame
.loc[]: Label-based indexingUses row and column labels (names)
Syntax:
df.loc[row_label, column_label]Supports slicing:
df.loc['A':'C', 'col1':'col3'](inclusive on both ends)Ideal when you know the index names
.iloc[]: Integer position-based indexingUses row and column positions (integers, starting from 0)
Syntax:
df.iloc[row_position, column_position]Supports slicing:
df.iloc[0:3, 0:2](exclusive on the end, like Python lists)Ideal when you need to select by position
.at[]/.iat[]: Fast access to a single value.at[row_label, col_label]— label-based, faster than.loc[].iat[row_pos, col_pos]— position-based, faster than.iloc[]Best used when reading or writing one cell, especially inside loops
When to yse each Method:
Use bracket notation for quick column selection
Use
.loc[]when working with labeled indices (most common in data analysis)Use
.iloc[]when you need position-based selection (e.g., “first 5 rows”)Use
.at[]/.iat[]when accessing a single cell — especially inside loops where performance matters
4.3.3.1. Common Patterns#
# Select a single column
df['Name']
# Select multiple columns
df[['Name', 'Age', 'City']]
# Select specific rows and columns by label
df.loc['Alice', 'Age']
df.loc[['Alice', 'Bob'], ['Name', 'Age']]
# Select by position
df.iloc[0, 1] # First row, second column
df.iloc[0:3, :] # First 3 rows, all columns
df.iloc[:, 0:2] # All rows, first 2 columns
# Boolean indexing
df[df['Age'] > 25] # Filter rows based on condition
Let’s practice these concepts with some exercises.
df_sal
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
4.3.3.2. Selecting Columns#
Pass a list of column names to select multiple columns. The result is a DataFrame.
df_sal['Salary']
Alice 75
Bob 82
Charlie 68
Diana 91
Eve 55
Name: Salary, dtype: int64
### Pass a list of column names
df_sal[['Salary','Rating']]
| Salary | Rating | |
|---|---|---|
| Alice | 75 | 4.2 |
| Bob | 82 | 4.7 |
| Charlie | 68 | 3.8 |
| Diana | 91 | 4.9 |
| Eve | 55 | 3.5 |
4.3.3.3. Using .loc[ ] - Label-based indexing#
ser = df_sal.loc['Alice']
print(type(ser), "\n")
ser
<class 'pandas.core.series.Series'>
Salary 75.0
Bonus 8.5
Years 3.0
Rating 4.2
Name: Alice, dtype: float64
df_sal.loc[['Alice', 'Bob']] ### pass a list
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
### Get Salary and Rating for Alice and Bob
df_sal.loc[['Alice','Bob'],['Salary','Rating']]
| Salary | Rating | |
|---|---|---|
| Alice | 75 | 4.2 |
| Bob | 82 | 4.7 |
4.3.3.4. Using .iloc[ ] - Position-based indexing#
### Get the third row (index position 2) - Charlie's data
df_sal.iloc[2]
Salary 68.0
Bonus 5.2
Years 2.0
Rating 3.8
Name: Charlie, dtype: float64
### Get first two rows (positions 0 and 1)
df_sal.iloc[0:2]
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
### Get value at row position 0, column position 0 (Alice's Salary)
df_sal.iloc[0, 0]
np.int64(75)
### Get first two rows and first two columns
df_sal.iloc[0:2, 0:2]
| Salary | Bonus | |
|---|---|---|
| Alice | 75 | 8.5 |
| Bob | 82 | 12.3 |
### Get all rows, but only columns at positions 0 and 2 (Salary and Rating)
df_sal.iloc[:, [0, 2]]
| Salary | Years | |
|---|---|---|
| Alice | 75 | 3 |
| Bob | 82 | 5 |
| Charlie | 68 | 2 |
| Diana | 91 | 7 |
| Eve | 55 | 1 |
4.3.3.5. .at[] and .iat[] — Fast Single-Value Access#
When you only need to read or write a single cell, .at[] and .iat[] are faster than .loc[] and .iloc[] because they are optimized for scalar access.
Accessor |
Based on |
Example |
|---|---|---|
|
Row/column labels |
|
|
Integer positions |
|
Use these when performance matters in a loop or when you just need one value.
# Using .at[] for fast label-based access to a single value
df_sal.at['Alice', 'Salary']
np.int64(75)
# Performance comparison: .at[] is faster than .loc[] for single values
# .loc[] works, but .at[] is optimized for this use case
print("Using .loc[]:", df_sal.loc['Bob', 'Rating'])
print("Using .at[]:", df_sal.at['Bob', 'Rating'])
Using .loc[]: 4.7
Using .at[]: 4.7
# .iat[] uses integer positions (row index, column index) — both are 0-based
print("Using .iloc[]:", df_sal.iloc[1, 3]) # row 1, col 3
print("Using .iat[]:", df_sal.iat[1, 3]) # same cell, faster for single value
Using .iloc[]: 4.7
Using .iat[]: 4.7
4.3.3.6. Boolean Indexing#
Boolean indexing lets you filter rows based on a condition. You pass a boolean expression inside [], and pandas returns only the rows where the condition is True.
A single condition:
df[df['Salary'] > 70]Multiple conditions: combine with
&(and) or|(or), and wrap each condition in parenthesesBest practice: use
.loc[]when you also want to select specific columns alongside the filterUse
.query()for a cleaner string-based syntax:df.query('Salary > 70 and Years > 2')
df_sal
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
df_sal[df_sal["Salary"]>70]
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Diana | 91 | 15.8 | 7 | 4.9 |
### Get Years for employees with Salary > 70
df_sal[df_sal['Salary']>70]['Years']
Alice 3
Bob 5
Diana 7
Name: Years, dtype: int64
### BETTER PRACTICE: Use .loc[] for row and column selection together
df_sal.loc[df_sal['Salary']>70, ['Years','Bonus']]
| Years | Bonus | |
|---|---|---|
| Alice | 3 | 8.5 |
| Bob | 5 | 12.3 |
| Diana | 7 | 15.8 |
### Combine conditions with & (and) or | (or)
### Each condition must be in parentheses
df_sal.loc[(df_sal['Salary']>70) & (df_sal['Years'] > 2)]
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Diana | 91 | 15.8 | 7 | 4.9 |
.query() is an alternative to boolean indexing that uses a cleaner string syntax — no need for df[...] wrappers or parentheses around each condition. It’s especially readable for multi-condition filters.
# Same result as above, but more readable
df_sal.query('Salary > 70 and Years > 2')
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Diana | 91 | 15.8 | 7 | 4.9 |
### EXERCISE: DataFrame Selection and Indexing
#
# Using df_sal (the employee salary DataFrame defined earlier):
# Index: Alice, Bob, Charlie, Diana, Eve
# Columns: Salary, Bonus, Years, Rating
#
# Tasks:
# 1. Select only the 'Salary' and 'Years' columns (use double brackets)
# 2. Use .loc[] to get the 'Bonus' and 'Rating' for 'Bob' and 'Diana'
# 3. Use .iloc[] to get the first 3 rows and first 2 columns
# 4. Use boolean indexing to filter employees with Rating >= 4.5
# 5. Use .query() to find employees with Years >= 3 and Bonus > 8
#
### Your code starts here:
### Your code ends here.
Salary and Years:
Salary Years
Alice 75 3
Bob 82 5
Charlie 68 2
Diana 91 7
Eve 55 1
Bonus and Rating for Bob and Diana:
Bonus Rating
Bob 12.3 4.7
Diana 15.8 4.9
First 3 rows, first 2 columns:
Salary Bonus
Alice 75 8.5
Bob 82 12.3
Charlie 68 5.2
Employees with Rating >= 4.5:
Salary Bonus Years Rating
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9
Employees with Years >= 3 and Bonus > 8:
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9
4.3.4. Modifying Rows/Columns#
4.3.4.1. Adding/Dropping columns#
Using the drop function. Note that you have to specify axis=1 to drop the column label.
Understanding axis parameter:
axis=0→ operates on rows (index) - “go down the rows”axis=1→ operates on columns - “go across the columns”
Mnemonic: Think of axis as the direction you’re moving. axis=0 goes down (rows), axis=1 goes across (columns).
### Create a new column: Total Compensation
df_sal['TotalComp'] = df_sal['Salary'] + df_sal['Bonus']
# Not modified! Most DataFrame operations are not in-place by default
df_sal
| Salary | Bonus | Years | Rating | TotalComp | |
|---|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 | 83.5 |
| Bob | 82 | 12.3 | 5 | 4.7 | 94.3 |
| Charlie | 68 | 5.2 | 2 | 3.8 | 73.2 |
| Diana | 91 | 15.8 | 7 | 4.9 | 106.8 |
| Eve | 55 | 3.1 | 1 | 3.5 | 58.1 |
df_sal.drop('TotalComp',axis=1,inplace=True)
df_sal
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
4.3.4.2. Dropping rows#
To drop rows, use the .drop() method with axis=0 (or no axis parameter since axis=0 is the default).
You can drop rows by specifying their index labels.
# Drop a single row by index label (returns a new DataFrame)
df_sal.drop('Charlie')
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Alice | 75 | 8.5 | 3 | 4.2 |
| Bob | 82 | 12.3 | 5 | 4.7 |
| Diana | 91 | 15.8 | 7 | 4.9 |
| Eve | 55 | 3.1 | 1 | 3.5 |
# Drop multiple rows (pass a list of index labels)
df_sal.drop(['Alice', 'Eve'], axis=0)
| Salary | Bonus | Years | Rating | |
|---|---|---|---|---|
| Bob | 82 | 12.3 | 5 | 4.7 |
| Charlie | 68 | 5.2 | 2 | 3.8 |
| Diana | 91 | 15.8 | 7 | 4.9 |
# Use inplace=True to modify the original DataFrame
# df_sal.drop('Bob', axis=0, inplace=True)
# Note: Commented out to preserve df_sal for other examples
### EXERCISE: DataFrame Indexing Practice
#
# Given the following sales data:
sales_data = pd.DataFrame({
'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
'Q1': [50, 200, 150, 75, 90],
'Q2': [65, 210, 160, 80, 95],
'Q3': [70, 190, 155, 85, 100],
'Q4': [80, 220, 170, 90, 110]
}, index=['P1', 'P2', 'P3', 'P4', 'P5'])
#
# Tasks:
# 1. Select the Q2 and Q4 columns
# 2. Select rows P2 and P4
# 3. Select Q3 sales for rows P1, P3, and P5 (should return a Series)
# 4. Add a new column 'Total' that sums Q1 through Q4 (Tip: use .sum(axis=1))
# 5. Drop the Q1 column (without modifying the original DataFrame)
#
# Hint: Use double brackets [[]] for selecting multiple columns
#
### Your code starts here:
### Your code ends here.
Q2 and Q4 columns:
Q2 Q4
P1 65 80
P2 210 220
P3 160 170
P4 80 90
P5 95 110
Rows P2 and P4:
Product Q1 Q2 Q3 Q4
P2 Mouse 200 210 190 220
P4 Monitor 75 80 85 90
Q3 sales for P1, P3, P5:
P1 70
P3 155
P5 100
Name: Q3, dtype: int64
With Total column:
Product Q1 Q2 Q3 Q4 Total
P1 Laptop 50 65 70 80 265
P2 Mouse 200 210 190 220 820
P3 Keyboard 150 160 155 170 635
P4 Monitor 75 80 85 90 330
P5 Webcam 90 95 100 110 395
Without Q1 column:
Product Q2 Q3 Q4
P1 Laptop 65 70 80
P2 Mouse 210 190 220
P3 Keyboard 160 155 170
P4 Monitor 80 85 90
P5 Webcam 95 100 110
4.3.5. Applying Functions with .apply()#
The .apply() method is a powerful tool that allows you to apply a function along an axis of the DataFrame (rows or columns). This is useful for transformations, calculations, and custom operations.
The syntax for .apply() is:
DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)
Parameters:
func: Function to apply to each column or rowaxis: {0 or ‘index’, 1 or ‘columns’}, default 00 or ‘index’: apply function to each column
1 or ‘columns’: apply function to each row
raw: bool, default False. If False, passes each row/column as a Series to the function. If True, passes ndarrayresult_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default Noneargs: tuple, positional arguments to pass to func**kwargs: additional keyword arguments to pass to func
Returns: Series or DataFrame
Performance Note: For simple operations like arithmetic or comparisons, vectorized operations are usually faster than .apply(). Use .apply() when you need to apply a complex function that can’t be vectorized.
# Vectorized (faster): df['A'] * 2
# Using apply (slower): df['A'].apply(lambda x: x * 2)
df_dates
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 2020-01-01 | 18 | 79 | 68 | 49 | 48 | 87 |
| 2020-01-02 | 17 | 72 | 28 | 18 | 57 | 97 |
| 2020-01-03 | 76 | 78 | 74 | 80 | 56 | 21 |
| 2020-01-04 | 85 | 50 | 55 | 43 | 26 | 93 |
# Apply a function to each column (axis=0, default)
print(df_dates.apply(np.sum))
print()
print(type(df_dates.apply(np.sum)))
A 196
B 279
C 225
D 190
E 187
F 298
dtype: int64
<class 'pandas.core.series.Series'>
df_dates.apply(np.sum, axis=1) ### axis = 1
2020-01-01 349
2020-01-02 289
2020-01-03 385
2020-01-04 352
Freq: D, dtype: int64
# Apply a custom function (double each value)
df_dates.apply(lambda x: x * 2)
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 2020-01-01 | 36 | 158 | 136 | 98 | 96 | 174 |
| 2020-01-02 | 34 | 144 | 56 | 36 | 114 | 194 |
| 2020-01-03 | 152 | 156 | 148 | 160 | 112 | 42 |
| 2020-01-04 | 170 | 100 | 110 | 86 | 52 | 186 |
### EXERCISE: Applying Functions
#
# Create a DataFrame:
apply_df = pd.DataFrame({
'A': [1, 2, 3, 4],
'B': [10, 20, 30, 40],
'C': [100, 200, 300, 400]
})
#
# Tasks:
# 1. Apply a function to get the sum of each column
# 2. Apply a function to get the maximum value in each row
# 3. Apply a lambda function to double all values in column A
# 4. Apply a custom function that adds 5 to all values
#
### Your code starts here:
### Your code ends here.
Sum of each column:
A 10
B 100
C 1000
dtype: int64
Maximum value in each row:
0 100
1 200
2 300
3 400
dtype: int64
Column A doubled:
0 2
1 4
2 6
3 8
Name: A, dtype: int64
All values plus 5:
A B C
0 6 15 105
1 7 25 205
2 8 35 305
3 9 45 405
4.3.6. Advanced Indexing#
Indices are a powerful feature in Pandas that distinguish it from simple arrays. This section covers advanced index operations including basic manipulation and hierarchical indexing.
df_dates
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| 2020-01-01 | 18 | 79 | 68 | 49 | 48 | 87 |
| 2020-01-02 | 17 | 72 | 28 | 18 | 57 | 97 |
| 2020-01-03 | 76 | 78 | 74 | 80 | 56 | 21 |
| 2020-01-04 | 85 | 50 | 55 | 43 | 26 | 93 |
4.3.6.1. Changing Index#
You can directly assign a new index to a DataFrame by setting the .index attribute. The new index must have the same length as the number of rows.
newind = 'MO MI MS MN'.split()
newind
['MO', 'MI', 'MS', 'MN']
df_dates.index = newind
df_dates
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| MO | 18 | 79 | 68 | 49 | 48 | 87 |
| MI | 17 | 72 | 28 | 18 | 57 | 97 |
| MS | 76 | 78 | 74 | 80 | 56 | 21 |
| MN | 85 | 50 | 55 | 43 | 26 | 93 |
4.3.6.1.1. Resetting and Setting Index#
Besides direct assignment, Pandas provides two powerful methods for manipulating DataFrame indices:
.reset_index(): Converts the current index back to a regular column and creates a new default integer index (0, 1, 2, …).
drop=False(default): Keeps the old index as a new columndrop=True: Discards the old index completelyinplace=False(default): Returns a new DataFrameinplace=True: Modifies the DataFrame in place
.set_index(): Converts a column into the index.
Takes a column name (or list of column names for MultiIndex)
The specified column(s) become the new index
drop=True(default): Removes the column from DataFrame after setting it as indexinplace=False(default): Returns a new DataFrame
These methods are particularly useful for:
Preparing data for merging operations
Converting between wide and long formats
Creating hierarchical indices
Organizing data for time series analysis
# Example: Create a DataFrame with custom index
sample_df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 92, 78],
'Grade': ['B', 'A', 'C']
}, index=[100, 200, 300])
print("Original DataFrame:")
print(sample_df)
print()
Original DataFrame:
Name Score Grade
100 Alice 85 B
200 Bob 92 A
300 Charlie 78 C
# reset_index() with drop=False: old index becomes a column named 'index'
reset_df = sample_df.reset_index()
print("After reset_index() with drop=False:")
print(reset_df)
print()
After reset_index() with drop=False:
index Name Score Grade
0 100 Alice 85 B
1 200 Bob 92 A
2 300 Charlie 78 C
# reset_index() with drop=True: old index is discarded
reset_df_dropped = sample_df.reset_index(drop=True)
print("After reset_index() with drop=True:")
print(reset_df_dropped)
print()
After reset_index() with drop=True:
Name Score Grade
0 Alice 85 B
1 Bob 92 A
2 Charlie 78 C
# set_index(): Make 'Name' column the new index
name_index_df = sample_df.set_index('Name')
print("After set_index('Name'):")
print(name_index_df)
print()
print("Notice 'Name' is now the index, not a column")
After set_index('Name'):
Score Grade
Name
Alice 85 B
Bob 92 A
Charlie 78 C
Notice 'Name' is now the index, not a column
# Chain reset_index() and set_index() together
# First reset to default index, then set 'Grade' as new index
chained_df = sample_df.reset_index(drop=True).set_index('Grade')
print("After chaining reset_index() and set_index('Grade'):")
print(chained_df)
After chaining reset_index() and set_index('Grade'):
Name Score
Grade
B Alice 85
A Bob 92
C Charlie 78
### EXERCISE: Index Manipulation
#
# Create a DataFrame:
idx_df = pd.DataFrame({
'Value': [10, 20, 30, 40],
'Category': ['A', 'B', 'A', 'C']
}, index=[100, 200, 300, 400])
#
# Tasks:
# 1. Display the original DataFrame
# 2. Reset the index (drop the old index)
# 3. Set 'Category' as the new index
# 4. Display the final DataFrame
#
### Your code starts here:
### Your code ends here.
Original DataFrame:
Value Category
100 10 A
200 20 B
300 30 A
400 40 C
After reset_index:
Value Category
0 10 A
1 20 B
2 30 A
3 40 C
With Category as index:
Value
Category
A 10
B 20
A 30
C 40
4.3.6.2. Multi-level index#
pd.MultiIndex.from_arrays()
# Create a hierarchical index using MultiIndex.from_tuples()
# Each tuple represents (outer_level, inner_level)
outside = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
inside = [1, 2, 3, 1, 2, 3]
hier_index = pd.MultiIndex.from_arrays([outside, inside], names=['Group', 'Num'])
hier_index
MultiIndex([('G1', 1),
('G1', 2),
('G1', 3),
('G2', 1),
('G2', 2),
('G2', 3)],
names=['Group', 'Num'])
df_multi = pd.DataFrame(data = np.random.randn(6,2), index=hier_index, columns=['A','B'])
df_multi
| A | B | ||
|---|---|---|---|
| Group | Num | ||
| G1 | 1 | -0.568813 | 0.109624 |
| 2 | -0.538005 | -0.262710 | |
| 3 | -1.862577 | -1.893508 | |
| G2 | 1 | 0.307188 | -0.372445 |
| 2 | -0.352839 | -0.453912 | |
| 3 | -0.180638 | 0.019537 |
df_multi.loc['G1']
| A | B | |
|---|---|---|
| Num | ||
| 1 | -0.568813 | 0.109624 |
| 2 | -0.538005 | -0.262710 |
| 3 | -1.862577 | -1.893508 |
df_multi.loc['G2'].loc[2]['B'] ### this 2 is really 2
np.float64(-0.45391226176240046)
df_multi.index.names
FrozenList(['Group', 'Num'])
df_multi
| A | B | ||
|---|---|---|---|
| Group | Num | ||
| G1 | 1 | -0.568813 | 0.109624 |
| 2 | -0.538005 | -0.262710 | |
| 3 | -1.862577 | -1.893508 | |
| G2 | 1 | 0.307188 | -0.372445 |
| 2 | -0.352839 | -0.453912 | |
| 3 | -0.180638 | 0.019537 |
The .xs() method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.
df_multi.xs(1, level='Num')
| A | B | |
|---|---|---|
| Group | ||
| G1 | -0.568813 | 0.109624 |
| G2 | 0.307188 | -0.372445 |
# Select all rows where Num = 1 (across all groups)
print(df_multi.xs(1, level='Num'))
A B
Group
G1 -0.568813 0.109624
G2 0.307188 -0.372445
print(df_multi.xs(2, level='Num'))
A B
Group
G1 -0.538005 -0.262710
G2 -0.352839 -0.453912
# Select all rows where Group = 'G1'
df_multi.xs('G1', level='Group')
| A | B | |
|---|---|---|
| Num | ||
| 1 | -0.568813 | 0.109624 |
| 2 | -0.538005 | -0.262710 |
| 3 | -1.862577 | -1.893508 |
# Select a specific (Group, Num) combination
df_multi.xs(('G1', 1))
A -0.568813
B 0.109624
Name: (G1, 1), dtype: float64
# Get Num=1 rows, only column A
df_multi.xs(1, level='Num')['A']
Group
G1 -0.568813
G2 0.307188
Name: A, dtype: float64
# Get Num=1 rows, both A and B
df_multi.xs(1, level='Num')[['A', 'B']]
| A | B | |
|---|---|---|
| Group | ||
| G1 | -0.568813 | 0.109624 |
| G2 | 0.307188 | -0.372445 |
### EXERCISE: Working with MultiIndex
#
# Create a MultiIndex DataFrame:
regions = ['North', 'North', 'North', 'South', 'South', 'South']
quarters = ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3']
multi_idx = pd.MultiIndex.from_arrays([regions, quarters], names=['Region', 'Quarter'])
sales_multi = pd.DataFrame({
'Revenue': [100, 150, 120, 80, 90, 110],
'Costs': [60, 70, 65, 50, 55, 60]
}, index=multi_idx)
#
# Tasks:
# 1. Display the DataFrame
# 2. Select all data for the 'North' region using .loc[]
# 3. Select all Q1 data across both regions using .xs()
# 4. Get the Revenue for South region, Q2
#
### Your code starts here:
### Your code ends here.
MultiIndex DataFrame:
Revenue Costs
Region Quarter
North Q1 100 60
Q2 150 70
Q3 120 65
South Q1 80 50
Q2 90 55
Q3 110 60
Data for North region:
Revenue Costs
Quarter
Q1 100 60
Q2 150 70
Q3 120 65
Q1 data across all regions:
Revenue Costs
Region
North 100 60
South 80 50
Revenue for South region, Q2:
90