4.3. DataFrames#

Pandas DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame is a two-dimensional, labeled data structure that can be thought of as:

  • A collection of Series objects combined into a single table, all sharing the same index

  • A spreadsheet or SQL table with rows and columns

  • A dictionary of Series, where each column is a Series with a shared index

Key Features of DataFrames

  • Labeled axes: Both rows (index) and columns have labels, making data access intuitive and explicit

  • Heterogeneous data: Different columns can contain different data types (integers, floats, strings, dates, etc.)

  • Size-mutable: You can insert and delete columns

  • Powerful operations: Built-in methods for filtering, grouping, merging, reshaping, and statistical analysis

  • Alignment of data based on labels

  • Missing data handling (see 0404-missing-data.ipynb for details)

  • Rich I/O tools for reading/writing various formats (CSV, Excel, SQL, JSON, etc.)

Why DataFrames?

While NumPy arrays are excellent for numerical computation, DataFrames offer several advantages:

  • Column names instead of just numeric indices

  • Mixed data types in different columns

  • Accessing and selecting data

  • Modifying DataFrames (adding/removing columns and rows)

  • Sorting and filtering

  • Working with indices, including MultiIndex

What You’ll Learn

In this notebook, we’ll explore:

  1. Creating DataFrames from various sources

  2. DataFrame attributes and methods

  3. Indexing and selecting data

  4. Sorting and filtering

  5. Working with indices, including MultiIndex

Note:

Let’s use Pandas to explore how DataFrames work!

import numpy as np
import pandas as pd

4.3.1. Creating DataFrames#

There are several common ways to create a DataFrame in pandas:

Method

Description

pd.read_csv()

Load data from a CSV file — the most common approach in practice

pd.DataFrame(dict)

Build a DataFrame from a Python dictionary (keys become column names)

pd.DataFrame(list of lists)

Build from a 2D list, optionally specifying column names

pd.DataFrame(np array)

Build from a NumPy array, useful for numeric/randomly generated data

In this section we’ll cover the two most common: reading from a file and constructing one from scratch.

4.3.1.1. pd.read_csv( )#

  • The most common way to read a file is to use read_csv() function.

# Read CSV file - both forward slashes (/) work on Windows, macOS, and Linux

df_sample = pd.read_csv("../../data/data.csv")

4.3.1.1.1. DataFrame Output and Preview#

To preview a dataframe, we usually use the .head() and .tail() methods.

There are several ways to display a DataFrame, each with different formatting and truncation behavior:

Method

Description

df_sample

Displays DataFrame in table format with first/last 5 rows by default

print(df_sample)

Prints text representation showing header(5) and tail(5) rows

df_sample.to_string()

Converts entire DataFrame to string without truncation

pd.options.display.max_rows

Controls how many rows are displayed before truncation

To display a DataFrame, simply type its name:

df_sample      # Default view is table-like
               # Pandas will show first 5 and last 5 rows (10 total) by default
               # The shape information is displayed at the bottom
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
... ... ... ... ...
160 30 85 120 250.4
161 45 90 130 260.4
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8

165 rows × 4 columns

Using print() displays the DataFrame in plain text format:

print(df_sample)   # print() shows same information but in plain text format            
     Duration  Pulse  Maxpulse  Calories
0          60    110       130     409.1
1          60    117       145     479.0
2          60    103       135     340.0
3          45    109       175     282.4
4          45    117       148     406.0
..        ...    ...       ...       ...
160        30     85       120     250.4
161        45     90       130     260.4
162        45     95       130     270.0
163        45    100       140     280.9
164        60    105       140     290.8

[165 rows x 4 columns]

.to_string()

  • .to_string() converts the entire DataFrame to a string without truncation.

  • useful when you need the complete string representation

  • the sample below is commented out because it will show the whole 339 rows of data.

### The to_string() method is used to display the entire 
### DataFrame as a string, without truncation.
### commented out to avoid overwhelming output in case of large DataFrame

# print(df_sample.to_string())

Pandas has a display.max_rows setting that is typically 60 line by default (system-dependent) and that’s the max number of lines you can print. When printing more than the max_rows, you get the default 10 (first 5 + last 5 ) rows. You need to change the pd.options.display.max_rows setting if you need to print more lines.

print(pd.options.display.max_rows, "\n")    ### current max_rows setting (default is 60)
print(df_sample.head(11), "\n")             ### 11 is less than max_rows, will print  
# print(df_sample.head(100), "\n")          ### 100 is more than max_rows, show first 5 + last 5 rows
60 

    Duration  Pulse  Maxpulse  Calories
0         60    110       130     409.1
1         60    117       145     479.0
2         60    103       135     340.0
3         45    109       175     282.4
4         45    117       148     406.0
5         60    102       127     300.5
6         60    110       136     374.0
7         45    104       134     253.3
8         30    109       133     195.1
9         60     98       124     269.0
10        60    103       147     329.3 

When working with large datasets, you often want to see just a sample of the data rather than the entire DataFrame.

# View first 5 rows (default)
df_sample.head()
Duration Pulse Maxpulse Calories
0 60 110 130 409.1
1 60 117 145 479.0
2 60 103 135 340.0
3 45 109 175 282.4
4 45 117 148 406.0
# View last 3 rows
df_sample.tail(3)
Duration Pulse Maxpulse Calories
162 45 95 130 270.0
163 45 100 140 280.9
164 60 105 140 290.8

For this exercise, either

  • download baby.csv and place it in your project data folder, or

  • just use pd.read_csv("../../data/baby.csv") in the Live Code cell to access the baby.csv dataset.

### EXERCISE: Reading CSV Files
# Using the data/baby.csv file:
# 1. Read the CSV file into a DataFrame called df_baby
# 2. Display the first 3 rows (use head() with appropriate argument)
# 3. Display the last 2 rows (use tail() with appropriate argument)
### Your code starts here:





### Your code ends here.

Hide code cell source

# Solution

import pandas as pd

# 1. Read the CSV file
df_baby = pd.read_csv('../../data/baby.csv')

# 2. Display the first 3 rows
print("First 3 rows:")
print(df_baby.head(3))
print()

# 3. Display the last 2 rows
print("Last 2 rows:")
print(df_baby.tail(2))
print()
First 3 rows:
   Birth Weight  Gestational Days  Maternal Age  Maternal Height  \
0           120               284            27               62   
1           113               282            33               64   
2           128               279            28               64   

   Maternal Pregnancy Weight  Maternal Smoker  
0                        100            False  
1                        135            False  
2                        115             True  

Last 2 rows:
      Birth Weight  Gestational Days  Maternal Age  Maternal Height  \
1172           125               281            21               65   
1173           117               297            38               65   

      Maternal Pregnancy Weight  Maternal Smoker  
1172                        110            False  
1173                        129            False  

4.3.1.2. Creating a DataFrame from NumPy Arrays#

So far we’ve been working with df_sample, which was loaded from a CSV file. Now let’s create a brand-new DataFrame from scratch using randomly generated data and a date-based index.

We use pd.date_range() to generate a sequence of consecutive dates, and np.random.default_rng(42) — the modern NumPy random number Generator — to fill the DataFrame with random integers. Passing a seed (42) ensures reproducibility so the same values are produced every time the code is run.

# np.random.seed(42)    ### old way to set random seed, still works but will show a warning in newer versions of NumPy
rng = np.random.default_rng(42)

dates = pd.date_range("2020-01-01", periods=4)  ### date_range() generates a sequence of dates starting from "2020-01-01" for 4 periods (4 days)

df_dates = pd.DataFrame(rng.integers(10, 100, size=(4, 6 )), index=dates, columns=list("ABCDEF"))
df_dates
A B C D E F
2020-01-01 18 79 68 49 48 87
2020-01-02 17 72 28 18 57 97
2020-01-03 76 78 74 80 56 21
2020-01-04 85 50 55 43 26 93

4.3.1.3. Creating a DataFrame from a Dictionary#

The most common way to create a DataFrame from scratch is to pass a dictionary where each key becomes a column name and its value is a list of column values. You can also pass a custom index to label the rows.

df_students = pd.DataFrame({
    'Name':  ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age':   [20, 22, 21, 23],
    'City':  ['NYC', 'LA', 'Chicago', 'Boston'],
    'Score': [85, 92, 67, 78]
}, index=['S1', 'S2', 'S3', 'S4'])

df_students
Name Age City Score
S1 Alice 20 NYC 85
S2 Bob 22 LA 92
S3 Charlie 21 Chicago 67
S4 Diana 23 Boston 78

You can add a new column using a boolean condition.

df_students['Pass'] = df_students['Score'] >= 70
df_students
Name Age City Score Pass
S1 Alice 20 NYC 85 True
S2 Bob 22 LA 92 True
S3 Charlie 21 Chicago 67 False
S4 Diana 23 Boston 78 True

Create DataFrame with employee salary data (in thousands)

df_sal = pd.DataFrame({
    'Salary': [75, 82, 68, 91, 55],
    'Bonus': [8.5, 12.3, 5.2, 15.8, 3.1],
    'Years': [3, 5, 2, 7, 1],
    'Rating': [4.2, 4.7, 3.8, 4.9, 3.5]
}, index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'])

df_sal
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5

4.3.2. DataFrame Attributes & Methods#

DataFrames have many useful attributes and methods that help us understand the structure and content of our data. Let’s explore the most commonly used ones.

4.3.2.1. The dtypes Attribute#

Note that Python The .dtypes attribute shows the data type of each column. Pandas infers types automatically when creating or loading a DataFrame. The most common types are:

dtype

Description

Example values

int64

Integer numbers

1, 42, -7

float64

Decimal numbers

3.14, NaN

object

Strings (or mixed types)

'Alice', 'NYC'

bool

Boolean values

True, False

datetime64

Dates and times

2020-01-01

Getting the dtypes right matters — numeric operations like .mean() won’t work on object columns, and storage is more efficient with proper types.

Type Concept

Comes From

Used For

Returns / Represents

int, float, str

Python

Built-in scalar data types

A Python object (e.g., 5, 3.14, "hello")

int64, float64

NumPy

Fixed-width numeric storage in arrays

A NumPy scalar type (e.g., np.int64, np.float64)

dtype

NumPy

Describing the data type of a single array/Series

A NumPy dtype object (e.g., dtype('int64'))

dtypes

pandas

Viewing all column data types in a DataFrame

A pandas Series of NumPy dtype objects

To get the dtypes of the salary dataframe, you issue df.dtypes and pandas returns a Series object of column dtypes.

df_sal
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5
df_sal.dtypes
Salary      int64
Bonus     float64
Years       int64
Rating    float64
dtype: object

When you select a column (Series) from a dataframe, the dtype of the dataframe will be shown at the end of the displayed Series. Or you can use the .dtype property to access the dtype.

df_sal['Salary']     ### ### Use bracket notation to select a single column
Alice      75
Bob        82
Charlie    68
Diana      91
Eve        55
Name: Salary, dtype: int64
df_sal['Salary'].dtype   ### Get the data type of the 'Salary' column (SERIES)
dtype('int64')

4.3.2.2. Shape and Size#

Understanding the dimensions of a DataFrame is essential when working with real-world datasets. The .shape, .size, and len() properties give you a quick overview of how much data you’re working with.

Item

Type

Parentheses?

Returns

Meaning

.shape

df property (attribute)

No

Tuple (rows, columns)

Dimensions of the DataFrame

.size

df property (attribute)

No

Integer

Total number of elements (rows × columns)

len(df)

Python built-in function

Yes

Integer

Number of rows

.head()

df method

Yes

DataFrame

First 5 rows (by default)

# Get the dimensions of the DataFrame (rows, columns)
df_sample.shape
print(df_sample.shape)

# Get the total number of elements (rows * columns)
df_sample.size
print(df_sample.size)

# Get the number of rows
len(df_sample)
print(len(df_sample))
(165, 4)
660
165

4.3.2.3. Columns and Index#

Every DataFrame has two key structural attributes: .columns, which holds the column labels, and .index, which labels the rows. Knowing how to access and work with these is fundamental to selecting and manipulating data.

# Get all column names
df_sample.columns
Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')
# Get the index (row labels)
df_sample.index
RangeIndex(start=0, stop=165, step=1)

4.3.2.4. Statistical Summary#

The .info() method gives you a concise overview of a DataFrame’s structure — including column names, data types, and non-null counts. The .describe() method then computes summary statistics (count, mean, min, max, etc.) for each column. Together, these are usually the first two things you run when exploring a new dataset.

4.3.2.4.1. The info() Method#

Now that we have the DataFrame object, let’s use df.info() to get the:

  1. data type

  2. number of values in the column

  3. Null values

# df_sample.drop(df_sample.index[df_sample.index > 169], inplace=True)
df_sample.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  165 non-null    int64  
 1   Pulse     165 non-null    int64  
 2   Maxpulse  165 non-null    int64  
 3   Calories  160 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.3 KB

4.3.2.4.2. The describe() Method#

.describe() generates summary statistics for all numeric columns in one shot. It returns:

Statistic

Description

count

Number of non-null values

mean

Average value

std

Standard deviation (spread of values)

min / max

Smallest and largest values

25%, 50%, 75%

Quartiles — the value below which 25%, 50%, and 75% of the data falls

The 50% row is the median. If mean and median differ significantly, the distribution is likely skewed.

df_sample.describe()   # Get summary statistics for numeric columns
Duration Pulse Maxpulse Calories
count 165.000000 165.000000 165.000000 160.000000
mean 63.757576 107.218182 133.721212 377.311250
std 42.792772 14.574063 16.508899 269.526378
min 15.000000 80.000000 100.000000 50.300000
25% 45.000000 100.000000 124.000000 250.625000
50% 60.000000 105.000000 131.000000 318.600000
75% 60.000000 111.000000 140.000000 388.200000
max 300.000000 159.000000 184.000000 1860.400000

For this exercise, either

  • download baby.csv and place it in your project data folder, or

  • just use pd.read_csv("../../data/actors.csv") in the Live Code cell to access the baby.csv dataset.

### EXERCISE: Exploring DataFrame Attributes
#
# Using the actors.csv file from the data directory:
# 1. Read the CSV file into a DataFrame called df_actors
# 2. Find the number of rows and columns (shape)
# 3. Check the data types of each column
# 4. Get a statistical summary of the DataFrame
# 5. Display the index values
# 6. Calculate the total number of elements
#
### Your code starts here:





### Your code ends here.

Hide code cell source

# Solution

# 1. Read the CSV file
df_actors = pd.read_csv('../../data/actors.csv')

# 2. Number of rows and columns
print("Shape (rows, columns):", df_actors.shape)
print()

# 3. Data types
print("Data types:")
print(df_actors.dtypes)
print()

# 4. Statistical summary
print("Statistical summary:")
print(df_actors.describe())
print()

# 5. Index values
print("Index:", df_actors.index.tolist())
print()

# 6. Total elements
print("Total elements:", df_actors.size)
Shape (rows, columns): (50, 6)

Data types:
Actor                 object
Total Gross          float64
Number of Movies       int64
Average per Movie    float64
#1 Movie              object
Gross                float64
dtype: object

Statistical summary:
       Total Gross  Number of Movies  Average per Movie       Gross
count    50.000000         50.000000          50.000000   50.000000
mean   3034.088000         37.880000          93.518000  428.690000
std     591.528271         14.122843          58.509175  184.344412
min    2416.500000          7.000000          39.000000  169.700000
25%    2612.425000         27.500000          67.550000  279.300000
50%    2824.500000         37.000000          79.050000  419.150000
75%    3185.800000         43.750000         100.125000  519.800000
max    4871.700000         79.000000         451.800000  936.700000

Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

Total elements: 300

4.3.3. DataFrame Selection and Indexing#

Selection and indexing are fundamental operations in pandas that allow you to extract specific portions of your data. Understanding these operations is crucial for effective data analysis and manipulation.

Key Concepts

Selection refers to extracting data from a DataFrame—whether that’s entire columns, specific rows, or individual cells. Pandas provides several methods for selection, each with its own use cases:

  1. Bracket notation []: The simplest method for selecting columns

    • df['column'] → selects a single column (returns a Series)

    • df[['col1', 'col2']] → selects multiple columns (returns a DataFrame)

    • Note: Use double brackets [[]] for multiple columns to get a DataFrame

  2. .loc[]: Label-based indexing

    • Uses row and column labels (names)

    • Syntax: df.loc[row_label, column_label]

    • Supports slicing: df.loc['A':'C', 'col1':'col3'] (inclusive on both ends)

    • Ideal when you know the index names

  3. .iloc[]: Integer position-based indexing

    • Uses row and column positions (integers, starting from 0)

    • Syntax: df.iloc[row_position, column_position]

    • Supports slicing: df.iloc[0:3, 0:2] (exclusive on the end, like Python lists)

    • Ideal when you need to select by position

  4. .at[] / .iat[]: Fast access to a single value

    • .at[row_label, col_label] — label-based, faster than .loc[]

    • .iat[row_pos, col_pos] — position-based, faster than .iloc[]

    • Best used when reading or writing one cell, especially inside loops

When to yse each Method:

  • Use bracket notation for quick column selection

  • Use .loc[] when working with labeled indices (most common in data analysis)

  • Use .iloc[] when you need position-based selection (e.g., “first 5 rows”)

  • Use .at[] / .iat[] when accessing a single cell — especially inside loops where performance matters

4.3.3.1. Common Patterns#

# Select a single column
df['Name']

# Select multiple columns
df[['Name', 'Age', 'City']]

# Select specific rows and columns by label
df.loc['Alice', 'Age']
df.loc[['Alice', 'Bob'], ['Name', 'Age']]

# Select by position
df.iloc[0, 1]        # First row, second column
df.iloc[0:3, :]      # First 3 rows, all columns
df.iloc[:, 0:2]      # All rows, first 2 columns

# Boolean indexing
df[df['Age'] > 25]   # Filter rows based on condition

Let’s practice these concepts with some exercises.

df_sal
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5

4.3.3.2. Selecting Columns#

Pass a list of column names to select multiple columns. The result is a DataFrame.

df_sal['Salary']
Alice      75
Bob        82
Charlie    68
Diana      91
Eve        55
Name: Salary, dtype: int64
### Pass a list of column names

df_sal[['Salary','Rating']]
Salary Rating
Alice 75 4.2
Bob 82 4.7
Charlie 68 3.8
Diana 91 4.9
Eve 55 3.5

4.3.3.3. Using .loc[ ] - Label-based indexing#

ser = df_sal.loc['Alice']
print(type(ser), "\n")
ser
<class 'pandas.core.series.Series'> 
Salary    75.0
Bonus      8.5
Years      3.0
Rating     4.2
Name: Alice, dtype: float64
df_sal.loc[['Alice', 'Bob']]    ### pass a list
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
### Get Salary and Rating for Alice and Bob
df_sal.loc[['Alice','Bob'],['Salary','Rating']]
Salary Rating
Alice 75 4.2
Bob 82 4.7

4.3.3.4. Using .iloc[ ] - Position-based indexing#

### Get the third row (index position 2) - Charlie's data
df_sal.iloc[2]
Salary    68.0
Bonus      5.2
Years      2.0
Rating     3.8
Name: Charlie, dtype: float64
### Get first two rows (positions 0 and 1)
df_sal.iloc[0:2]
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
### Get value at row position 0, column position 0 (Alice's Salary)
df_sal.iloc[0, 0]
np.int64(75)
### Get first two rows and first two columns
df_sal.iloc[0:2, 0:2]
Salary Bonus
Alice 75 8.5
Bob 82 12.3
### Get all rows, but only columns at positions 0 and 2 (Salary and Rating)
df_sal.iloc[:, [0, 2]]
Salary Years
Alice 75 3
Bob 82 5
Charlie 68 2
Diana 91 7
Eve 55 1

4.3.3.5. .at[] and .iat[] — Fast Single-Value Access#

When you only need to read or write a single cell, .at[] and .iat[] are faster than .loc[] and .iloc[] because they are optimized for scalar access.

Accessor

Based on

Example

.at[label, col]

Row/column labels

df.at['Alice', 'Salary']

.iat[row, col]

Integer positions

df.iat[0, 0]

Use these when performance matters in a loop or when you just need one value.

# Using .at[] for fast label-based access to a single value
df_sal.at['Alice', 'Salary']
np.int64(75)
# Performance comparison: .at[] is faster than .loc[] for single values
# .loc[] works, but .at[] is optimized for this use case
print("Using .loc[]:", df_sal.loc['Bob', 'Rating'])
print("Using .at[]:", df_sal.at['Bob', 'Rating'])
Using .loc[]: 4.7
Using .at[]: 4.7
# .iat[] uses integer positions (row index, column index) — both are 0-based
print("Using .iloc[]:", df_sal.iloc[1, 3])   # row 1, col 3
print("Using .iat[]:", df_sal.iat[1, 3])     # same cell, faster for single value
Using .iloc[]: 4.7
Using .iat[]: 4.7

4.3.3.6. Boolean Indexing#

Boolean indexing lets you filter rows based on a condition. You pass a boolean expression inside [], and pandas returns only the rows where the condition is True.

  • A single condition: df[df['Salary'] > 70]

  • Multiple conditions: combine with & (and) or | (or), and wrap each condition in parentheses

  • Best practice: use .loc[] when you also want to select specific columns alongside the filter

  • Use .query() for a cleaner string-based syntax: df.query('Salary > 70 and Years > 2')

df_sal
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5
df_sal[df_sal["Salary"]>70]
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9
### Get Years for employees with Salary > 70
df_sal[df_sal['Salary']>70]['Years']
Alice    3
Bob      5
Diana    7
Name: Years, dtype: int64
### BETTER PRACTICE: Use .loc[] for row and column selection together
df_sal.loc[df_sal['Salary']>70, ['Years','Bonus']]
Years Bonus
Alice 3 8.5
Bob 5 12.3
Diana 7 15.8
### Combine conditions with & (and) or | (or)
### Each condition must be in parentheses

df_sal.loc[(df_sal['Salary']>70) & (df_sal['Years'] > 2)]
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9

.query() is an alternative to boolean indexing that uses a cleaner string syntax — no need for df[...] wrappers or parentheses around each condition. It’s especially readable for multi-condition filters.

# Same result as above, but more readable
df_sal.query('Salary > 70 and Years > 2')
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9
### EXERCISE: DataFrame Selection and Indexing
#
# Using df_sal (the employee salary DataFrame defined earlier):
#   Index: Alice, Bob, Charlie, Diana, Eve
#   Columns: Salary, Bonus, Years, Rating
#
# Tasks:
# 1. Select only the 'Salary' and 'Years' columns (use double brackets)
# 2. Use .loc[] to get the 'Bonus' and 'Rating' for 'Bob' and 'Diana'
# 3. Use .iloc[] to get the first 3 rows and first 2 columns
# 4. Use boolean indexing to filter employees with Rating >= 4.5
# 5. Use .query() to find employees with Years >= 3 and Bonus > 8
#
### Your code starts here:




### Your code ends here.

Hide code cell source

# Solution

# 1. Select Salary and Years columns
print("Salary and Years:")
print(df_sal[['Salary', 'Years']])
print()

# 2. Bonus and Rating for Bob and Diana
print("Bonus and Rating for Bob and Diana:")
print(df_sal.loc[['Bob', 'Diana'], ['Bonus', 'Rating']])
print()

# 3. First 3 rows and first 2 columns using .iloc[]
print("First 3 rows, first 2 columns:")
print(df_sal.iloc[0:3, 0:2])
print()

# 4. Use boolean indexing to filter employees with Rating >= 4.5
print("Employees with Rating >= 4.5:")
print(df_sal[df_sal['Rating'] >= 4.5])
print()

# 5. Use .query() to find employees with Years >= 3 and Bonus > 8
print("Employees with Years >= 3 and Bonus > 8:")
print(df_sal.query('Years >= 3 and Bonus > 8'))
Salary and Years:
         Salary  Years
Alice        75      3
Bob          82      5
Charlie      68      2
Diana        91      7
Eve          55      1

Bonus and Rating for Bob and Diana:
       Bonus  Rating
Bob     12.3     4.7
Diana   15.8     4.9

First 3 rows, first 2 columns:
         Salary  Bonus
Alice        75    8.5
Bob          82   12.3
Charlie      68    5.2

Employees with Rating >= 4.5:
       Salary  Bonus  Years  Rating
Bob        82   12.3      5     4.7
Diana      91   15.8      7     4.9

Employees with Years >= 3 and Bonus > 8:
       Salary  Bonus  Years  Rating
Alice      75    8.5      3     4.2
Bob        82   12.3      5     4.7
Diana      91   15.8      7     4.9

4.3.4. Modifying Rows/Columns#

4.3.4.1. Adding/Dropping columns#

Using the drop function. Note that you have to specify axis=1 to drop the column label.

Understanding axis parameter:

  • axis=0 → operates on rows (index) - “go down the rows”

  • axis=1 → operates on columns - “go across the columns”

Mnemonic: Think of axis as the direction you’re moving. axis=0 goes down (rows), axis=1 goes across (columns).

### Create a new column: Total Compensation
df_sal['TotalComp'] = df_sal['Salary'] + df_sal['Bonus']
# Not modified! Most DataFrame operations are not in-place by default
df_sal
Salary Bonus Years Rating TotalComp
Alice 75 8.5 3 4.2 83.5
Bob 82 12.3 5 4.7 94.3
Charlie 68 5.2 2 3.8 73.2
Diana 91 15.8 7 4.9 106.8
Eve 55 3.1 1 3.5 58.1
df_sal.drop('TotalComp',axis=1,inplace=True)

df_sal
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5

4.3.4.2. Dropping rows#

To drop rows, use the .drop() method with axis=0 (or no axis parameter since axis=0 is the default).

You can drop rows by specifying their index labels.

# Drop a single row by index label (returns a new DataFrame)
df_sal.drop('Charlie')
Salary Bonus Years Rating
Alice 75 8.5 3 4.2
Bob 82 12.3 5 4.7
Diana 91 15.8 7 4.9
Eve 55 3.1 1 3.5
# Drop multiple rows (pass a list of index labels)
df_sal.drop(['Alice', 'Eve'], axis=0)
Salary Bonus Years Rating
Bob 82 12.3 5 4.7
Charlie 68 5.2 2 3.8
Diana 91 15.8 7 4.9
# Use inplace=True to modify the original DataFrame
# df_sal.drop('Bob', axis=0, inplace=True)

# Note: Commented out to preserve df_sal for other examples
### EXERCISE: DataFrame Indexing Practice
#
# Given the following sales data:
sales_data = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Q1': [50, 200, 150, 75, 90],
    'Q2': [65, 210, 160, 80, 95],
    'Q3': [70, 190, 155, 85, 100],
    'Q4': [80, 220, 170, 90, 110]
}, index=['P1', 'P2', 'P3', 'P4', 'P5'])
#
# Tasks:
# 1. Select the Q2 and Q4 columns
# 2. Select rows P2 and P4
# 3. Select Q3 sales for rows P1, P3, and P5 (should return a Series)
# 4. Add a new column 'Total' that sums Q1 through Q4 (Tip: use .sum(axis=1))
# 5. Drop the Q1 column (without modifying the original DataFrame)
#
# Hint: Use double brackets [[]] for selecting multiple columns
#
### Your code starts here:





### Your code ends here.

Hide code cell source

# Solution

# 1. Select the Q2 and Q4 columns
print("Q2 and Q4 columns:")
print(sales_data[['Q2', 'Q4']])
print()

# 2. Select rows P2 and P4
print("Rows P2 and P4:")
print(sales_data.loc[['P2', 'P4']])
print()

# 3. Select Q3 sales for rows P1, P3, and P5
print("Q3 sales for P1, P3, P5:")
print(sales_data.loc[['P1', 'P3', 'P5'], 'Q3'])
print()

# 4. Add a new column 'Total' that sums Q1 through Q4
sales_data_with_total = sales_data.copy()
sales_data_with_total['Total'] = sales_data_with_total[['Q1', 'Q2', 'Q3', 'Q4']].sum(axis=1)
print("With Total column:")
print(sales_data_with_total)
print()

# 5. Drop the Q1 column
print("Without Q1 column:")
print(sales_data.drop('Q1', axis=1))
Q2 and Q4 columns:
     Q2   Q4
P1   65   80
P2  210  220
P3  160  170
P4   80   90
P5   95  110

Rows P2 and P4:
    Product   Q1   Q2   Q3   Q4
P2    Mouse  200  210  190  220
P4  Monitor   75   80   85   90

Q3 sales for P1, P3, P5:
P1     70
P3    155
P5    100
Name: Q3, dtype: int64

With Total column:
     Product   Q1   Q2   Q3   Q4  Total
P1    Laptop   50   65   70   80    265
P2     Mouse  200  210  190  220    820
P3  Keyboard  150  160  155  170    635
P4   Monitor   75   80   85   90    330
P5    Webcam   90   95  100  110    395

Without Q1 column:
     Product   Q2   Q3   Q4
P1    Laptop   65   70   80
P2     Mouse  210  190  220
P3  Keyboard  160  155  170
P4   Monitor   80   85   90
P5    Webcam   95  100  110

4.3.5. Applying Functions with .apply()#

The .apply() method is a powerful tool that allows you to apply a function along an axis of the DataFrame (rows or columns). This is useful for transformations, calculations, and custom operations.

The syntax for .apply() is:

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

Parameters:

  • func: Function to apply to each column or row

  • axis: {0 or ‘index’, 1 or ‘columns’}, default 0

    • 0 or ‘index’: apply function to each column

    • 1 or ‘columns’: apply function to each row

  • raw: bool, default False. If False, passes each row/column as a Series to the function. If True, passes ndarray

  • result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None

  • args: tuple, positional arguments to pass to func

  • **kwargs: additional keyword arguments to pass to func

Returns: Series or DataFrame

Performance Note: For simple operations like arithmetic or comparisons, vectorized operations are usually faster than .apply(). Use .apply() when you need to apply a complex function that can’t be vectorized.

# Vectorized (faster): df['A'] * 2
# Using apply (slower): df['A'].apply(lambda x: x * 2)
df_dates
A B C D E F
2020-01-01 18 79 68 49 48 87
2020-01-02 17 72 28 18 57 97
2020-01-03 76 78 74 80 56 21
2020-01-04 85 50 55 43 26 93
# Apply a function to each column (axis=0, default)

print(df_dates.apply(np.sum))
print()
print(type(df_dates.apply(np.sum)))
A    196
B    279
C    225
D    190
E    187
F    298
dtype: int64

<class 'pandas.core.series.Series'>
df_dates.apply(np.sum, axis=1)   ### axis = 1
2020-01-01    349
2020-01-02    289
2020-01-03    385
2020-01-04    352
Freq: D, dtype: int64
# Apply a custom function (double each value)
df_dates.apply(lambda x: x * 2)
A B C D E F
2020-01-01 36 158 136 98 96 174
2020-01-02 34 144 56 36 114 194
2020-01-03 152 156 148 160 112 42
2020-01-04 170 100 110 86 52 186
### EXERCISE: Applying Functions
#
# Create a DataFrame:
apply_df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})
#
# Tasks:
# 1. Apply a function to get the sum of each column
# 2. Apply a function to get the maximum value in each row
# 3. Apply a lambda function to double all values in column A
# 4. Apply a custom function that adds 5 to all values
#
### Your code starts here:





### Your code ends here.

Hide code cell source

# Solution

# 1. Apply a function to get the sum of each column
print("Sum of each column:")
print(apply_df.apply(np.sum))
print()

# 2. Apply a function to get the maximum value in each row
print("Maximum value in each row:")
print(apply_df.apply(np.max, axis=1))
print()

# 3. Apply a lambda function to double all values in column A
print("Column A doubled:")
print(apply_df['A'].apply(lambda x: x * 2))
print()

# 4. Apply a custom function that adds 5 to all values
def add_five(x):
    return x + 5

print("All values plus 5:")
print(apply_df.apply(add_five))
Sum of each column:
A      10
B     100
C    1000
dtype: int64

Maximum value in each row:
0    100
1    200
2    300
3    400
dtype: int64

Column A doubled:
0    2
1    4
2    6
3    8
Name: A, dtype: int64

All values plus 5:
   A   B    C
0  6  15  105
1  7  25  205
2  8  35  305
3  9  45  405

4.3.6. Advanced Indexing#

Indices are a powerful feature in Pandas that distinguish it from simple arrays. This section covers advanced index operations including basic manipulation and hierarchical indexing.

df_dates
A B C D E F
2020-01-01 18 79 68 49 48 87
2020-01-02 17 72 28 18 57 97
2020-01-03 76 78 74 80 56 21
2020-01-04 85 50 55 43 26 93

4.3.6.1. Changing Index#

You can directly assign a new index to a DataFrame by setting the .index attribute. The new index must have the same length as the number of rows.

newind = 'MO MI MS MN'.split()
newind
['MO', 'MI', 'MS', 'MN']
df_dates.index = newind
df_dates
A B C D E F
MO 18 79 68 49 48 87
MI 17 72 28 18 57 97
MS 76 78 74 80 56 21
MN 85 50 55 43 26 93

4.3.6.1.1. Resetting and Setting Index#

Besides direct assignment, Pandas provides two powerful methods for manipulating DataFrame indices:

.reset_index(): Converts the current index back to a regular column and creates a new default integer index (0, 1, 2, …).

  • drop=False (default): Keeps the old index as a new column

  • drop=True: Discards the old index completely

  • inplace=False (default): Returns a new DataFrame

  • inplace=True: Modifies the DataFrame in place

.set_index(): Converts a column into the index.

  • Takes a column name (or list of column names for MultiIndex)

  • The specified column(s) become the new index

  • drop=True (default): Removes the column from DataFrame after setting it as index

  • inplace=False (default): Returns a new DataFrame

These methods are particularly useful for:

  • Preparing data for merging operations

  • Converting between wide and long formats

  • Creating hierarchical indices

  • Organizing data for time series analysis

# Example: Create a DataFrame with custom index
sample_df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78],
    'Grade': ['B', 'A', 'C']
}, index=[100, 200, 300])

print("Original DataFrame:")
print(sample_df)
print()
Original DataFrame:
        Name  Score Grade
100    Alice     85     B
200      Bob     92     A
300  Charlie     78     C
# reset_index() with drop=False: old index becomes a column named 'index'
reset_df = sample_df.reset_index()
print("After reset_index() with drop=False:")
print(reset_df)
print()
After reset_index() with drop=False:
   index     Name  Score Grade
0    100    Alice     85     B
1    200      Bob     92     A
2    300  Charlie     78     C
# reset_index() with drop=True: old index is discarded
reset_df_dropped = sample_df.reset_index(drop=True)
print("After reset_index() with drop=True:")
print(reset_df_dropped)
print()
After reset_index() with drop=True:
      Name  Score Grade
0    Alice     85     B
1      Bob     92     A
2  Charlie     78     C
# set_index(): Make 'Name' column the new index
name_index_df = sample_df.set_index('Name')
print("After set_index('Name'):")
print(name_index_df)
print()
print("Notice 'Name' is now the index, not a column")
After set_index('Name'):
         Score Grade
Name                
Alice       85     B
Bob         92     A
Charlie     78     C

Notice 'Name' is now the index, not a column
# Chain reset_index() and set_index() together
# First reset to default index, then set 'Grade' as new index
chained_df = sample_df.reset_index(drop=True).set_index('Grade')
print("After chaining reset_index() and set_index('Grade'):")
print(chained_df)
After chaining reset_index() and set_index('Grade'):
          Name  Score
Grade                
B        Alice     85
A          Bob     92
C      Charlie     78
### EXERCISE: Index Manipulation
#
# Create a DataFrame:
idx_df = pd.DataFrame({
    'Value': [10, 20, 30, 40],
    'Category': ['A', 'B', 'A', 'C']
}, index=[100, 200, 300, 400])
#
# Tasks:
# 1. Display the original DataFrame
# 2. Reset the index (drop the old index)
# 3. Set 'Category' as the new index
# 4. Display the final DataFrame
#
### Your code starts here:





### Your code ends here.

Hide code cell source

# Solution

idx_df = pd.DataFrame({
    'Value': [10, 20, 30, 40],
    'Category': ['A', 'B', 'A', 'C']
}, index=[100, 200, 300, 400])

# 1. Display the DataFrame
print("Original DataFrame:")
print(idx_df)
print()

# 2. Reset index
idx_df = idx_df.reset_index(drop=True)
print("After reset_index:")
print(idx_df)
print()

# 3. Set Category as index
idx_df = idx_df.set_index('Category')

# 4. Display final DataFrame
print("With Category as index:")
print(idx_df)
Original DataFrame:
     Value Category
100     10        A
200     20        B
300     30        A
400     40        C

After reset_index:
   Value Category
0     10        A
1     20        B
2     30        A
3     40        C

With Category as index:
          Value
Category       
A            10
B            20
A            30
C            40

4.3.6.2. Multi-level index#

  • pd.MultiIndex.from_arrays()

# Create a hierarchical index using MultiIndex.from_tuples()
# Each tuple represents (outer_level, inner_level)
outside = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
inside = [1, 2, 3, 1, 2, 3]
hier_index = pd.MultiIndex.from_arrays([outside, inside], names=['Group', 'Num'])
hier_index
MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           names=['Group', 'Num'])
df_multi = pd.DataFrame(data = np.random.randn(6,2), index=hier_index, columns=['A','B'])
df_multi
A B
Group Num
G1 1 -0.568813 0.109624
2 -0.538005 -0.262710
3 -1.862577 -1.893508
G2 1 0.307188 -0.372445
2 -0.352839 -0.453912
3 -0.180638 0.019537
df_multi.loc['G1']
A B
Num
1 -0.568813 0.109624
2 -0.538005 -0.262710
3 -1.862577 -1.893508
df_multi.loc['G2'].loc[2]['B']      ### this 2 is really 2
np.float64(-0.45391226176240046)
df_multi.index.names
FrozenList(['Group', 'Num'])
df_multi
A B
Group Num
G1 1 -0.568813 0.109624
2 -0.538005 -0.262710
3 -1.862577 -1.893508
G2 1 0.307188 -0.372445
2 -0.352839 -0.453912
3 -0.180638 0.019537

The .xs() method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.

df_multi.xs(1, level='Num')
A B
Group
G1 -0.568813 0.109624
G2 0.307188 -0.372445
# Select all rows where Num = 1 (across all groups)
print(df_multi.xs(1, level='Num'))
              A         B
Group                    
G1    -0.568813  0.109624
G2     0.307188 -0.372445
print(df_multi.xs(2, level='Num'))
              A         B
Group                    
G1    -0.538005 -0.262710
G2    -0.352839 -0.453912
# Select all rows where Group = 'G1'
df_multi.xs('G1', level='Group')
A B
Num
1 -0.568813 0.109624
2 -0.538005 -0.262710
3 -1.862577 -1.893508
# Select a specific (Group, Num) combination
df_multi.xs(('G1', 1))
A   -0.568813
B    0.109624
Name: (G1, 1), dtype: float64
# Get Num=1 rows, only column A
df_multi.xs(1, level='Num')['A']
Group
G1   -0.568813
G2    0.307188
Name: A, dtype: float64
# Get Num=1 rows, both A and B
df_multi.xs(1, level='Num')[['A', 'B']]
A B
Group
G1 -0.568813 0.109624
G2 0.307188 -0.372445
### EXERCISE: Working with MultiIndex
#
# Create a MultiIndex DataFrame:
regions = ['North', 'North', 'North', 'South', 'South', 'South']
quarters = ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3']
multi_idx = pd.MultiIndex.from_arrays([regions, quarters], names=['Region', 'Quarter'])
sales_multi = pd.DataFrame({
    'Revenue': [100, 150, 120, 80, 90, 110],
    'Costs': [60, 70, 65, 50, 55, 60]
}, index=multi_idx)
#
# Tasks:
# 1. Display the DataFrame
# 2. Select all data for the 'North' region using .loc[]
# 3. Select all Q1 data across both regions using .xs()
# 4. Get the Revenue for South region, Q2
#
### Your code starts here:





### Your code ends here.

Hide code cell source

### solution
# 1. Display the DataFrame
print("MultiIndex DataFrame:")
print(sales_multi)
print()     
# 2. Select all data for the 'North' region using .loc[]
print("Data for North region:")
print(sales_multi.loc['North'])
print()
# 3. Select all Q1 data across both regions using .xs()
print("Q1 data across all regions:")
print(sales_multi.xs('Q1', level='Quarter'))
print()
# 4. Get the Revenue for South region, Q2
print("Revenue for South region, Q2:")
print(sales_multi.loc[('South', 'Q2'), 'Revenue'])
MultiIndex DataFrame:
                Revenue  Costs
Region Quarter                
North  Q1           100     60
       Q2           150     70
       Q3           120     65
South  Q1            80     50
       Q2            90     55
       Q3           110     60

Data for North region:
         Revenue  Costs
Quarter                
Q1           100     60
Q2           150     70
Q3           120     65

Q1 data across all regions:
        Revenue  Costs
Region                
North       100     60
South        80     50

Revenue for South region, Q2:
90