DataFrames

4.3. DataFrames#

Pandas DataFrames are the core data structure in Pandas, modeled after similar structures in the R programming language. A DataFrame is a two-dimensional, labeled data structure that can be thought of as:

A collection of Series objects combined into a single table, all sharing the same index
A spreadsheet or SQL table with rows and columns
A dictionary of Series, where each column is a Series with a shared index

Key Features of DataFrames

Labeled axes: Both rows (index) and columns have labels, making data access intuitive and explicit
Heterogeneous data: Different columns can contain different data types (integers, floats, strings, dates, etc.)
Size-mutable: You can insert and delete columns
Powerful operations: Built-in methods for filtering, grouping, merging, reshaping, and statistical analysis
Alignment of data based on labels
Missing data handling (see 0404-missing-data.ipynb for details)
Rich I/O tools for reading/writing various formats (CSV, Excel, SQL, JSON, etc.)

Why DataFrames?

While NumPy arrays are excellent for numerical computation, DataFrames offer several advantages:

Column names instead of just numeric indices
Mixed data types in different columns
Accessing and selecting data
Modifying DataFrames (adding/removing columns and rows)
Sorting and filtering
Working with indices, including MultiIndex

What You’ll Learn

In this notebook, we’ll explore:

Creating DataFrames from various sources
DataFrame attributes and methods
Indexing and selecting data
Sorting and filtering
Working with indices, including MultiIndex

Note:

Missing data handling is covered in detail in notebook 0404-missing-data.ipynb.
Additional Resources:

Let’s use Pandas to explore how DataFrames work!

import numpy as np
import pandas as pd

4.3.1. Creating DataFrames#

There are several common ways to create a DataFrame in pandas:

Method	Description
`pd.read_csv()`	Load data from a CSV file — the most common approach in practice
`pd.DataFrame(dict)`	Build a DataFrame from a Python dictionary (keys become column names)
`pd.DataFrame(list of lists)`	Build from a 2D list, optionally specifying column names
`pd.DataFrame(np array)`	Build from a NumPy array, useful for numeric/randomly generated data

In this section we’ll cover the two most common: reading from a file and constructing one from scratch.

4.3.1.1. `pd.read_csv( )`#

The most common way to read a file is to use read_csv() function.

# Read CSV file - both forward slashes (/) work on Windows, macOS, and Linux

df_sample = pd.read_csv("../../data/data.csv")

4.3.1.1.1. DataFrame Output and Preview#

To preview a dataframe, we usually use the .head() and .tail() methods.

There are several ways to display a DataFrame, each with different formatting and truncation behavior:

Method	Description
`df_sample`	Displays DataFrame in table format with first/last 5 rows by default
`print(df_sample)`	Prints text representation showing header(5) and tail(5) rows
`df_sample.to_string()`	Converts entire DataFrame to string without truncation
`pd.options.display.max_rows`	Controls how many rows are displayed before truncation

To display a DataFrame, simply type its name:

df_sample      # Default view is table-like
               # Pandas will show first 5 and last 5 rows (10 total) by default
               # The shape information is displayed at the bottom

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0
3	45	109	175	282.4
4	45	117	148	406.0
...	...	...	...	...
160	30	85	120	250.4
161	45	90	130	260.4
162	45	95	130	270.0
163	45	100	140	280.9
164	60	105	140	290.8

165 rows × 4 columns

Using print() displays the DataFrame in plain text format:

print(df_sample)   # print() shows same information but in plain text format            

     Duration  Pulse  Maxpulse  Calories
        60    110       130     409.1
        60    117       145     479.0
        60    103       135     340.0
        45    109       175     282.4
        45    117       148     406.0
..        ...    ...       ...       ...
      30     85       120     250.4
      45     90       130     260.4
      45     95       130     270.0
      45    100       140     280.9
      60    105       140     290.8

[165 rows x 4 columns]

.to_string()

.to_string() converts the entire DataFrame to a string without truncation.
useful when you need the complete string representation
the sample below is commented out because it will show the whole 339 rows of data.

### The to_string() method is used to display the entire 
### DataFrame as a string, without truncation.
### commented out to avoid overwhelming output in case of large DataFrame

# print(df_sample.to_string())

Pandas has a display.max_rows setting that is typically 60 line by default (system-dependent) and that’s the max number of lines you can print. When printing more than the max_rows, you get the default 10 (first 5 + last 5 ) rows. You need to change the pd.options.display.max_rows setting if you need to print more lines.

print(pd.options.display.max_rows, "\n")    ### current max_rows setting (default is 60)
print(df_sample.head(11), "\n")             ### 11 is less than max_rows, will print  
# print(df_sample.head(100), "\n")          ### 100 is more than max_rows, show first 5 + last 5 rows



    Duration  Pulse  Maxpulse  Calories
       60    110       130     409.1
       60    117       145     479.0
       60    103       135     340.0
       45    109       175     282.4
       45    117       148     406.0
       60    102       127     300.5
       60    110       136     374.0
       45    104       134     253.3
       30    109       133     195.1
       60     98       124     269.0
      60    103       147     329.3 

When working with large datasets, you often want to see just a sample of the data rather than the entire DataFrame.

# View first 5 rows (default)
df_sample.head()

	Duration	Pulse	Maxpulse	Calories
0	60	110	130	409.1
1	60	117	145	479.0
2	60	103	135	340.0
3	45	109	175	282.4
4	45	117	148	406.0

# View last 3 rows
df_sample.tail(3)

	Duration	Pulse	Maxpulse	Calories
162	45	95	130	270.0
163	45	100	140	280.9
164	60	105	140	290.8

For this exercise, either

download baby.csv and place it in your project data folder, or
just use pd.read_csv("../../data/baby.csv") in the Live Code cell to access the baby.csv dataset.

### EXERCISE: Reading CSV Files
# Using the data/baby.csv file:
# 1. Read the CSV file into a DataFrame called df_baby
# 2. Display the first 3 rows (use head() with appropriate argument)
# 3. Display the last 2 rows (use tail() with appropriate argument)
### Your code starts here:

### Your code ends here.

First 3 rows:
   Birth Weight  Gestational Days  Maternal Age  Maternal Height  \
0           120               284            27               62   
1           113               282            33               64   
2           128               279            28               64   

   Maternal Pregnancy Weight  Maternal Smoker  
0                        100            False  
1                        135            False  
2                        115             True  

Last 2 rows:
      Birth Weight  Gestational Days  Maternal Age  Maternal Height  \
1172           125               281            21               65   
1173           117               297            38               65   

      Maternal Pregnancy Weight  Maternal Smoker  
1172                        110            False  
1173                        129            False  

4.3.1.2. Creating a DataFrame from NumPy Arrays#

So far we’ve been working with df_sample, which was loaded from a CSV file. Now let’s create a brand-new DataFrame from scratch using randomly generated data and a date-based index.

We use pd.date_range() to generate a sequence of consecutive dates, and np.random.default_rng(42) — the modern NumPy random number Generator — to fill the DataFrame with random integers. Passing a seed (42) ensures reproducibility so the same values are produced every time the code is run.

# np.random.seed(42)    ### old way to set random seed, still works but will show a warning in newer versions of NumPy
rng = np.random.default_rng(42)

dates = pd.date_range("2020-01-01", periods=4)  ### date_range() generates a sequence of dates starting from "2020-01-01" for 4 periods (4 days)

df_dates = pd.DataFrame(rng.integers(10, 100, size=(4, 6 )), index=dates, columns=list("ABCDEF"))
df_dates

	A	B	C	D	E	F
2020-01-01	18	79	68	49	48	87
2020-01-02	17	72	28	18	57	97
2020-01-03	76	78	74	80	56	21
2020-01-04	85	50	55	43	26	93

4.3.1.3. Creating a DataFrame from a Dictionary#

The most common way to create a DataFrame from scratch is to pass a dictionary where each key becomes a column name and its value is a list of column values. You can also pass a custom index to label the rows.

df_students = pd.DataFrame({
    'Name':  ['Alice', 'Bob', 'Charlie', 'Diana'],
    'Age':   [20, 22, 21, 23],
    'City':  ['NYC', 'LA', 'Chicago', 'Boston'],
    'Score': [85, 92, 67, 78]
}, index=['S1', 'S2', 'S3', 'S4'])

df_students

	Name	Age	City	Score
S1	Alice	20	NYC	85
S2	Bob	22	LA	92
S3	Charlie	21	Chicago	67
S4	Diana	23	Boston	78

You can add a new column using a boolean condition.

df_students['Pass'] = df_students['Score'] >= 70
df_students

	Name	Age	City	Score	Pass
S1	Alice	20	NYC	85	True
S2	Bob	22	LA	92	True
S3	Charlie	21	Chicago	67	False
S4	Diana	23	Boston	78	True

Create DataFrame with employee salary data (in thousands)

df_sal = pd.DataFrame({
    'Salary': [75, 82, 68, 91, 55],
    'Bonus': [8.5, 12.3, 5.2, 15.8, 3.1],
    'Years': [3, 5, 2, 7, 1],
    'Rating': [4.2, 4.7, 3.8, 4.9, 3.5]
}, index=['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'])

df_sal

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

4.3.2. DataFrame Attributes & Methods#

DataFrames have many useful attributes and methods that help us understand the structure and content of our data. Let’s explore the most commonly used ones.

4.3.2.1. The `dtypes` Attribute#

Note that Python The .dtypes attribute shows the data type of each column. Pandas infers types automatically when creating or loading a DataFrame. The most common types are:

dtype	Description	Example values
`int64`	Integer numbers	`1`, `42`, `-7`
`float64`	Decimal numbers	`3.14`, `NaN`
`object`	Strings (or mixed types)	`'Alice'`, `'NYC'`
`bool`	Boolean values	`True`, `False`
`datetime64`	Dates and times	`2020-01-01`

Getting the dtypes right matters — numeric operations like .mean() won’t work on object columns, and storage is more efficient with proper types.

Type Concept	Comes From	Used For	Returns / Represents
`int`, `float`, `str`	Python	Built-in scalar data types	A Python object (e.g., `5`, `3.14`, `"hello"`)
`int64`, `float64`	NumPy	Fixed-width numeric storage in arrays	A NumPy scalar type (e.g., `np.int64`, `np.float64`)
`dtype`	NumPy	Describing the data type of a single array/Series	A NumPy `dtype` object (e.g., `dtype('int64')`)
`dtypes`	pandas	Viewing all column data types in a DataFrame	A pandas Series of NumPy `dtype` objects

To get the dtypes of the salary dataframe, you issue df.dtypes and pandas returns a Series object of column dtypes.

df_sal

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

df_sal.dtypes

Salary      int64
Bonus     float64
Years       int64
Rating    float64
dtype: object

When you select a column (Series) from a dataframe, the dtype of the dataframe will be shown at the end of the displayed Series. Or you can use the .dtype property to access the dtype.

df_sal['Salary']     ### ### Use bracket notation to select a single column

Alice      75
Bob        82
Charlie    68
Diana      91
Eve        55
Name: Salary, dtype: int64

df_sal['Salary'].dtype   ### Get the data type of the 'Salary' column (SERIES)

dtype('int64')

4.3.2.2. Shape and Size#

Understanding the dimensions of a DataFrame is essential when working with real-world datasets. The .shape, .size, and len() properties give you a quick overview of how much data you’re working with.

Item	Type	Parentheses?	Returns	Meaning
`.shape`	df property (attribute)	No	Tuple `(rows, columns)`	Dimensions of the DataFrame
`.size`	df property (attribute)	No	Integer	Total number of elements (`rows × columns`)
`len(df)`	Python built-in function	Yes	Integer	Number of rows
`.head()`	df method	Yes	DataFrame	First 5 rows (by default)

# Get the dimensions of the DataFrame (rows, columns)
df_sample.shape
print(df_sample.shape)

# Get the total number of elements (rows * columns)
df_sample.size
print(df_sample.size)

# Get the number of rows
len(df_sample)
print(len(df_sample))

(165, 4)
660
165

4.3.2.3. Columns and Index#

Every DataFrame has two key structural attributes: .columns, which holds the column labels, and .index, which labels the rows. Knowing how to access and work with these is fundamental to selecting and manipulating data.

# Get all column names
df_sample.columns

Index(['Duration', 'Pulse', 'Maxpulse', 'Calories'], dtype='object')

# Get the index (row labels)
df_sample.index

RangeIndex(start=0, stop=165, step=1)

4.3.2.4. Statistical Summary#

The .info() method gives you a concise overview of a DataFrame’s structure — including column names, data types, and non-null counts. The .describe() method then computes summary statistics (count, mean, min, max, etc.) for each column. Together, these are usually the first two things you run when exploring a new dataset.

4.3.2.4.1. The `info()` Method#

Now that we have the DataFrame object, let’s use df.info() to get the:

data type
number of values in the column
Null values

# df_sample.drop(df_sample.index[df_sample.index > 169], inplace=True)
df_sample.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165 entries, 0 to 164
Data columns (total 4 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Duration  165 non-null    int64  
 1   Pulse     165 non-null    int64  
 2   Maxpulse  165 non-null    int64  
 3   Calories  160 non-null    float64
dtypes: float64(1), int64(3)
memory usage: 5.3 KB

4.3.2.4.2. The `describe()` Method#

.describe() generates summary statistics for all numeric columns in one shot. It returns:

Statistic	Description
`count`	Number of non-null values
`mean`	Average value
`std`	Standard deviation (spread of values)
`min` / `max`	Smallest and largest values
`25%`, `50%`, `75%`	Quartiles — the value below which 25%, 50%, and 75% of the data falls

The 50% row is the median. If mean and median differ significantly, the distribution is likely skewed.

df_sample.describe()   # Get summary statistics for numeric columns

	Duration	Pulse	Maxpulse	Calories
count	165.000000	165.000000	165.000000	160.000000
mean	63.757576	107.218182	133.721212	377.311250
std	42.792772	14.574063	16.508899	269.526378
min	15.000000	80.000000	100.000000	50.300000
25%	45.000000	100.000000	124.000000	250.625000
50%	60.000000	105.000000	131.000000	318.600000
75%	60.000000	111.000000	140.000000	388.200000
max	300.000000	159.000000	184.000000	1860.400000

For this exercise, either

download baby.csv and place it in your project data folder, or
just use pd.read_csv("../../data/actors.csv") in the Live Code cell to access the baby.csv dataset.

### EXERCISE: Exploring DataFrame Attributes
#
# Using the actors.csv file from the data directory:
# 1. Read the CSV file into a DataFrame called df_actors
# 2. Find the number of rows and columns (shape)
# 3. Check the data types of each column
# 4. Get a statistical summary of the DataFrame
# 5. Display the index values
# 6. Calculate the total number of elements
#
### Your code starts here:





### Your code ends here.

Shape (rows, columns): (50, 6)

Data types:
Actor                 object
Total Gross          float64
Number of Movies       int64
Average per Movie    float64
#1 Movie              object
Gross                float64
dtype: object

Statistical summary:
       Total Gross  Number of Movies  Average per Movie       Gross
count    50.000000         50.000000          50.000000   50.000000
mean   3034.088000         37.880000          93.518000  428.690000
std     591.528271         14.122843          58.509175  184.344412
min    2416.500000          7.000000          39.000000  169.700000
25%    2612.425000         27.500000          67.550000  279.300000
50%    2824.500000         37.000000          79.050000  419.150000
75%    3185.800000         43.750000         100.125000  519.800000
max    4871.700000         79.000000         451.800000  936.700000

Index: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]

Total elements: 300

4.3.3. DataFrame Selection and Indexing#

Selection and indexing are fundamental operations in pandas that allow you to extract specific portions of your data. Understanding these operations is crucial for effective data analysis and manipulation.

Key Concepts

Selection refers to extracting data from a DataFrame—whether that’s entire columns, specific rows, or individual cells. Pandas provides several methods for selection, each with its own use cases:

Bracket notation []: The simplest method for selecting columns
- df['column'] → selects a single column (returns a Series)
- df[['col1', 'col2']] → selects multiple columns (returns a DataFrame)
- Note: Use double brackets [[]] for multiple columns to get a DataFrame
.loc[]: Label-based indexing
- Uses row and column labels (names)
- Syntax: df.loc[row_label, column_label]
- Supports slicing: df.loc['A':'C', 'col1':'col3'] (inclusive on both ends)
- Ideal when you know the index names
.iloc[]: Integer position-based indexing
- Uses row and column positions (integers, starting from 0)
- Syntax: df.iloc[row_position, column_position]
- Supports slicing: df.iloc[0:3, 0:2] (exclusive on the end, like Python lists)
- Ideal when you need to select by position
.at[] / .iat[]: Fast access to a single value
- .at[row_label, col_label] — label-based, faster than .loc[]
- .iat[row_pos, col_pos] — position-based, faster than .iloc[]
- Best used when reading or writing one cell, especially inside loops

When to yse each Method:

Use bracket notation for quick column selection
Use .loc[] when working with labeled indices (most common in data analysis)
Use .iloc[] when you need position-based selection (e.g., “first 5 rows”)
Use .at[] / .iat[] when accessing a single cell — especially inside loops where performance matters

4.3.3.1. Common Patterns#

# Select a single column
df['Name']

# Select multiple columns
df[['Name', 'Age', 'City']]

# Select specific rows and columns by label
df.loc['Alice', 'Age']
df.loc[['Alice', 'Bob'], ['Name', 'Age']]

# Select by position
df.iloc[0, 1]        # First row, second column
df.iloc[0:3, :]      # First 3 rows, all columns
df.iloc[:, 0:2]      # All rows, first 2 columns

# Boolean indexing
df[df['Age'] > 25]   # Filter rows based on condition

Let’s practice these concepts with some exercises.

df_sal

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

4.3.3.2. Selecting Columns#

Pass a list of column names to select multiple columns. The result is a DataFrame.

df_sal['Salary']

Alice      75
Bob        82
Charlie    68
Diana      91
Eve        55
Name: Salary, dtype: int64

### Pass a list of column names

df_sal[['Salary','Rating']]

	Salary	Rating
Alice	75	4.2
Bob	82	4.7
Charlie	68	3.8
Diana	91	4.9
Eve	55	3.5

4.3.3.3. Using .loc[ ] - Label-based indexing#

ser = df_sal.loc['Alice']
print(type(ser), "\n")
ser

<class 'pandas.core.series.Series'>

Salary    75.0
Bonus      8.5
Years      3.0
Rating     4.2
Name: Alice, dtype: float64

df_sal.loc[['Alice', 'Bob']]    ### pass a list

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7

### Get Salary and Rating for Alice and Bob
df_sal.loc[['Alice','Bob'],['Salary','Rating']]

	Salary	Rating
Alice	75	4.2
Bob	82	4.7

4.3.3.4. Using .iloc[ ] - Position-based indexing#

### Get the third row (index position 2) - Charlie's data
df_sal.iloc[2]

Salary    68.0
Bonus      5.2
Years      2.0
Rating     3.8
Name: Charlie, dtype: float64

### Get first two rows (positions 0 and 1)
df_sal.iloc[0:2]

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7

### Get value at row position 0, column position 0 (Alice's Salary)
df_sal.iloc[0, 0]

np.int64(75)

### Get first two rows and first two columns
df_sal.iloc[0:2, 0:2]

	Salary	Bonus
Alice	75	8.5
Bob	82	12.3

### Get all rows, but only columns at positions 0 and 2 (Salary and Rating)
df_sal.iloc[:, [0, 2]]

	Salary	Years
Alice	75	3
Bob	82	5
Charlie	68	2
Diana	91	7
Eve	55	1

4.3.3.5. `.at[]` and `.iat[]` — Fast Single-Value Access#

When you only need to read or write a single cell, .at[] and .iat[] are faster than .loc[] and .iloc[] because they are optimized for scalar access.

Accessor	Based on	Example
`.at[label, col]`	Row/column labels	`df.at['Alice', 'Salary']`
`.iat[row, col]`	Integer positions	`df.iat[0, 0]`

Use these when performance matters in a loop or when you just need one value.

# Using .at[] for fast label-based access to a single value
df_sal.at['Alice', 'Salary']

np.int64(75)

# Performance comparison: .at[] is faster than .loc[] for single values
# .loc[] works, but .at[] is optimized for this use case
print("Using .loc[]:", df_sal.loc['Bob', 'Rating'])
print("Using .at[]:", df_sal.at['Bob', 'Rating'])

Using .loc[]: 4.7
Using .at[]: 4.7

# .iat[] uses integer positions (row index, column index) — both are 0-based
print("Using .iloc[]:", df_sal.iloc[1, 3])   # row 1, col 3
print("Using .iat[]:", df_sal.iat[1, 3])     # same cell, faster for single value

Using .iloc[]: 4.7
Using .iat[]: 4.7

4.3.3.6. Boolean Indexing#

Boolean indexing lets you filter rows based on a condition. You pass a boolean expression inside [], and pandas returns only the rows where the condition is True.

A single condition: df[df['Salary'] > 70]
Multiple conditions: combine with & (and) or | (or), and wrap each condition in parentheses
Best practice: use .loc[] when you also want to select specific columns alongside the filter
Use .query() for a cleaner string-based syntax: df.query('Salary > 70 and Years > 2')

df_sal

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

df_sal[df_sal["Salary"]>70]

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Diana	91	15.8	7	4.9

### Get Years for employees with Salary > 70
df_sal[df_sal['Salary']>70]['Years']

Alice    3
Bob      5
Diana    7
Name: Years, dtype: int64

### BETTER PRACTICE: Use .loc[] for row and column selection together
df_sal.loc[df_sal['Salary']>70, ['Years','Bonus']]

	Years	Bonus
Alice	3	8.5
Bob	5	12.3
Diana	7	15.8

### Combine conditions with & (and) or | (or)
### Each condition must be in parentheses

df_sal.loc[(df_sal['Salary']>70) & (df_sal['Years'] > 2)]

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Diana	91	15.8	7	4.9

.query() is an alternative to boolean indexing that uses a cleaner string syntax — no need for df[...] wrappers or parentheses around each condition. It’s especially readable for multi-condition filters.

# Same result as above, but more readable
df_sal.query('Salary > 70 and Years > 2')

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Diana	91	15.8	7	4.9

### EXERCISE: DataFrame Selection and Indexing
#
# Using df_sal (the employee salary DataFrame defined earlier):
#   Index: Alice, Bob, Charlie, Diana, Eve
#   Columns: Salary, Bonus, Years, Rating
#
# Tasks:
# 1. Select only the 'Salary' and 'Years' columns (use double brackets)
# 2. Use .loc[] to get the 'Bonus' and 'Rating' for 'Bob' and 'Diana'
# 3. Use .iloc[] to get the first 3 rows and first 2 columns
# 4. Use boolean indexing to filter employees with Rating >= 4.5
# 5. Use .query() to find employees with Years >= 3 and Bonus > 8
#
### Your code starts here:




### Your code ends here.

Salary and Years:
         Salary  Years
Alice        75      3
Bob          82      5
Charlie      68      2
Diana        91      7
Eve          55      1

Bonus and Rating for Bob and Diana:
       Bonus  Rating
Bob     12.3     4.7
Diana   15.8     4.9

First 3 rows, first 2 columns:
         Salary  Bonus
Alice        75    8.5
Bob          82   12.3
Charlie      68    5.2

Employees with Rating >= 4.5:
       Salary  Bonus  Years  Rating
Bob        82   12.3      5     4.7
Diana      91   15.8      7     4.9

Employees with Years >= 3 and Bonus > 8:
       Salary  Bonus  Years  Rating
Alice      75    8.5      3     4.2
Bob        82   12.3      5     4.7
Diana      91   15.8      7     4.9

4.3.4. Modifying Rows/Columns#

4.3.4.1. Adding/Dropping columns#

Using the drop function. Note that you have to specify axis=1 to drop the column label.

Understanding axis parameter:

axis=0 → operates on rows (index) - “go down the rows”
axis=1 → operates on columns - “go across the columns”

Mnemonic: Think of axis as the direction you’re moving. axis=0 goes down (rows), axis=1 goes across (columns).

### Create a new column: Total Compensation
df_sal['TotalComp'] = df_sal['Salary'] + df_sal['Bonus']

# Not modified! Most DataFrame operations are not in-place by default
df_sal

	Salary	Bonus	Years	Rating	TotalComp
Alice	75	8.5	3	4.2	83.5
Bob	82	12.3	5	4.7	94.3
Charlie	68	5.2	2	3.8	73.2
Diana	91	15.8	7	4.9	106.8
Eve	55	3.1	1	3.5	58.1

df_sal.drop('TotalComp',axis=1,inplace=True)

df_sal

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

4.3.4.2. Dropping rows#

To drop rows, use the .drop() method with axis=0 (or no axis parameter since axis=0 is the default).

You can drop rows by specifying their index labels.

# Drop a single row by index label (returns a new DataFrame)
df_sal.drop('Charlie')

	Salary	Bonus	Years	Rating
Alice	75	8.5	3	4.2
Bob	82	12.3	5	4.7
Diana	91	15.8	7	4.9
Eve	55	3.1	1	3.5

# Drop multiple rows (pass a list of index labels)
df_sal.drop(['Alice', 'Eve'], axis=0)

	Salary	Bonus	Years	Rating
Bob	82	12.3	5	4.7
Charlie	68	5.2	2	3.8
Diana	91	15.8	7	4.9

# Use inplace=True to modify the original DataFrame
# df_sal.drop('Bob', axis=0, inplace=True)

# Note: Commented out to preserve df_sal for other examples

### EXERCISE: DataFrame Indexing Practice
#
# Given the following sales data:
sales_data = pd.DataFrame({
    'Product': ['Laptop', 'Mouse', 'Keyboard', 'Monitor', 'Webcam'],
    'Q1': [50, 200, 150, 75, 90],
    'Q2': [65, 210, 160, 80, 95],
    'Q3': [70, 190, 155, 85, 100],
    'Q4': [80, 220, 170, 90, 110]
}, index=['P1', 'P2', 'P3', 'P4', 'P5'])
#
# Tasks:
# 1. Select the Q2 and Q4 columns
# 2. Select rows P2 and P4
# 3. Select Q3 sales for rows P1, P3, and P5 (should return a Series)
# 4. Add a new column 'Total' that sums Q1 through Q4 (Tip: use .sum(axis=1))
# 5. Drop the Q1 column (without modifying the original DataFrame)
#
# Hint: Use double brackets [[]] for selecting multiple columns
#
### Your code starts here:





### Your code ends here.

Q2 and Q4 columns:
     Q2   Q4
P1   65   80
P2  210  220
P3  160  170
P4   80   90
P5   95  110

Rows P2 and P4:
    Product   Q1   Q2   Q3   Q4
P2    Mouse  200  210  190  220
P4  Monitor   75   80   85   90

Q3 sales for P1, P3, P5:
P1     70
P3    155
P5    100
Name: Q3, dtype: int64

With Total column:
     Product   Q1   Q2   Q3   Q4  Total
P1    Laptop   50   65   70   80    265
P2     Mouse  200  210  190  220    820
P3  Keyboard  150  160  155  170    635
P4   Monitor   75   80   85   90    330
P5    Webcam   90   95  100  110    395

Without Q1 column:
     Product   Q2   Q3   Q4
P1    Laptop   65   70   80
P2     Mouse  210  190  220
P3  Keyboard  160  155  170
P4   Monitor   80   85   90
P5    Webcam   95  100  110

4.3.5. Applying Functions with `.apply()`#

The .apply() method is a powerful tool that allows you to apply a function along an axis of the DataFrame (rows or columns). This is useful for transformations, calculations, and custom operations.

The syntax for .apply() is:

DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwargs)

Parameters:

func: Function to apply to each column or row
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
- 0 or ‘index’: apply function to each column
- 1 or ‘columns’: apply function to each row
raw: bool, default False. If False, passes each row/column as a Series to the function. If True, passes ndarray
result_type: {‘expand’, ‘reduce’, ‘broadcast’, None}, default None
args: tuple, positional arguments to pass to func
**kwargs: additional keyword arguments to pass to func

Returns: Series or DataFrame

Performance Note: For simple operations like arithmetic or comparisons, vectorized operations are usually faster than .apply(). Use .apply() when you need to apply a complex function that can’t be vectorized.

# Vectorized (faster): df['A'] * 2
# Using apply (slower): df['A'].apply(lambda x: x * 2)

df_dates

	A	B	C	D	E	F
2020-01-01	18	79	68	49	48	87
2020-01-02	17	72	28	18	57	97
2020-01-03	76	78	74	80	56	21
2020-01-04	85	50	55	43	26	93

# Apply a function to each column (axis=0, default)

print(df_dates.apply(np.sum))
print()
print(type(df_dates.apply(np.sum)))

A    196
B    279
C    225
D    190
E    187
F    298
dtype: int64

<class 'pandas.core.series.Series'>

df_dates.apply(np.sum, axis=1)   ### axis = 1

2020-01-01    349
2020-01-02    289
2020-01-03    385
2020-01-04    352
Freq: D, dtype: int64

# Apply a custom function (double each value)
df_dates.apply(lambda x: x * 2)

	A	B	C	D	E	F
2020-01-01	36	158	136	98	96	174
2020-01-02	34	144	56	36	114	194
2020-01-03	152	156	148	160	112	42
2020-01-04	170	100	110	86	52	186

### EXERCISE: Applying Functions
#
# Create a DataFrame:
apply_df = pd.DataFrame({
    'A': [1, 2, 3, 4],
    'B': [10, 20, 30, 40],
    'C': [100, 200, 300, 400]
})
#
# Tasks:
# 1. Apply a function to get the sum of each column
# 2. Apply a function to get the maximum value in each row
# 3. Apply a lambda function to double all values in column A
# 4. Apply a custom function that adds 5 to all values
#
### Your code starts here:





### Your code ends here.

Sum of each column:
A      10
B     100
C    1000
dtype: int64

Maximum value in each row:
0    100
1    200
2    300
3    400
dtype: int64

Column A doubled:
0    2
1    4
2    6
3    8
Name: A, dtype: int64

All values plus 5:
   A   B    C
0  6  15  105
1  7  25  205
2  8  35  305
3  9  45  405

4.3.6. Advanced Indexing#

Indices are a powerful feature in Pandas that distinguish it from simple arrays. This section covers advanced index operations including basic manipulation and hierarchical indexing.

df_dates

	A	B	C	D	E	F
2020-01-01	18	79	68	49	48	87
2020-01-02	17	72	28	18	57	97
2020-01-03	76	78	74	80	56	21
2020-01-04	85	50	55	43	26	93

4.3.6.1. Changing Index#

You can directly assign a new index to a DataFrame by setting the .index attribute. The new index must have the same length as the number of rows.

newind = 'MO MI MS MN'.split()
newind

['MO', 'MI', 'MS', 'MN']

df_dates.index = newind

df_dates

	A	B	C	D	E	F
MO	18	79	68	49	48	87
MI	17	72	28	18	57	97
MS	76	78	74	80	56	21
MN	85	50	55	43	26	93

4.3.6.1.1. Resetting and Setting Index#

Besides direct assignment, Pandas provides two powerful methods for manipulating DataFrame indices:

.reset_index(): Converts the current index back to a regular column and creates a new default integer index (0, 1, 2, …).

drop=False (default): Keeps the old index as a new column
drop=True: Discards the old index completely
inplace=False (default): Returns a new DataFrame
inplace=True: Modifies the DataFrame in place

.set_index(): Converts a column into the index.

Takes a column name (or list of column names for MultiIndex)
The specified column(s) become the new index
drop=True (default): Removes the column from DataFrame after setting it as index
inplace=False (default): Returns a new DataFrame

These methods are particularly useful for:

Preparing data for merging operations
Converting between wide and long formats
Creating hierarchical indices
Organizing data for time series analysis

# Example: Create a DataFrame with custom index
sample_df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 92, 78],
    'Grade': ['B', 'A', 'C']
}, index=[100, 200, 300])

print("Original DataFrame:")
print(sample_df)
print()

Original DataFrame:
        Name  Score Grade
100    Alice     85     B
200      Bob     92     A
300  Charlie     78     C

# reset_index() with drop=False: old index becomes a column named 'index'
reset_df = sample_df.reset_index()
print("After reset_index() with drop=False:")
print(reset_df)
print()

After reset_index() with drop=False:
   index     Name  Score Grade
0    100    Alice     85     B
1    200      Bob     92     A
2    300  Charlie     78     C

# reset_index() with drop=True: old index is discarded
reset_df_dropped = sample_df.reset_index(drop=True)
print("After reset_index() with drop=True:")
print(reset_df_dropped)
print()

After reset_index() with drop=True:
      Name  Score Grade
0    Alice     85     B
1      Bob     92     A
2  Charlie     78     C

# set_index(): Make 'Name' column the new index
name_index_df = sample_df.set_index('Name')
print("After set_index('Name'):")
print(name_index_df)
print()
print("Notice 'Name' is now the index, not a column")

After set_index('Name'):
         Score Grade
Name                
Alice       85     B
Bob         92     A
Charlie     78     C

Notice 'Name' is now the index, not a column

# Chain reset_index() and set_index() together
# First reset to default index, then set 'Grade' as new index
chained_df = sample_df.reset_index(drop=True).set_index('Grade')
print("After chaining reset_index() and set_index('Grade'):")
print(chained_df)

After chaining reset_index() and set_index('Grade'):
          Name  Score
Grade                
B        Alice     85
A          Bob     92
C      Charlie     78

### EXERCISE: Index Manipulation
#
# Create a DataFrame:
idx_df = pd.DataFrame({
    'Value': [10, 20, 30, 40],
    'Category': ['A', 'B', 'A', 'C']
}, index=[100, 200, 300, 400])
#
# Tasks:
# 1. Display the original DataFrame
# 2. Reset the index (drop the old index)
# 3. Set 'Category' as the new index
# 4. Display the final DataFrame
#
### Your code starts here:





### Your code ends here.

Original DataFrame:
     Value Category
100     10        A
200     20        B
300     30        A
400     40        C

After reset_index:
   Value Category
0     10        A
1     20        B
2     30        A
3     40        C

With Category as index:
          Value
Category       
A            10
B            20
A            30
C            40

4.3.6.2. Multi-level index#

pd.MultiIndex.from_arrays()

# Create a hierarchical index using MultiIndex.from_tuples()
# Each tuple represents (outer_level, inner_level)
outside = ['G1', 'G1', 'G1', 'G2', 'G2', 'G2']
inside = [1, 2, 3, 1, 2, 3]
hier_index = pd.MultiIndex.from_arrays([outside, inside], names=['Group', 'Num'])
hier_index

MultiIndex([('G1', 1),
            ('G1', 2),
            ('G1', 3),
            ('G2', 1),
            ('G2', 2),
            ('G2', 3)],
           names=['Group', 'Num'])

df_multi = pd.DataFrame(data = np.random.randn(6,2), index=hier_index, columns=['A','B'])
df_multi

		A	B
Group	Num
G1	1	0.413164	1.562466
	2	1.512594	0.678518
	3	1.050664	0.623776
G2	1	-0.259616	0.475404
	2	0.874888	-1.120739
	3	1.425627	1.406627

df_multi.loc['G1']

	A	B
Num
1	0.413164	1.562466
2	1.512594	0.678518
3	1.050664	0.623776

df_multi.loc['G2'].loc[2]['B']      ### this 2 is really 2

np.float64(-1.1207394737407113)

df_multi.index.names

FrozenList(['Group', 'Num'])

df_multi

		A	B
Group	Num
G1	1	0.413164	1.562466
	2	1.512594	0.678518
	3	1.050664	0.623776
G2	1	-0.259616	0.475404
	2	0.874888	-1.120739
	3	1.425627	1.406627

The .xs() method (short for cross-section) is a convenient way to select rows or columns from a Pandas DataFrame, especially when working with MultiIndex objects. It works like a shortcut for slicing.

df_multi.xs(1, level='Num')

	A	B
Group
G1	0.413164	1.562466
G2	-0.259616	0.475404

# Select all rows where Num = 1 (across all groups)
print(df_multi.xs(1, level='Num'))

              A         B
Group                    
G1     0.413164  1.562466
G2    -0.259616  0.475404

print(df_multi.xs(2, level='Num'))

              A         B
Group                    
G1     1.512594  0.678518
G2     0.874888 -1.120739

# Select all rows where Group = 'G1'
df_multi.xs('G1', level='Group')

	A	B
Num
1	0.413164	1.562466
2	1.512594	0.678518
3	1.050664	0.623776

# Select a specific (Group, Num) combination
df_multi.xs(('G1', 1))

A    0.413164
B    1.562466
Name: (G1, 1), dtype: float64

# Get Num=1 rows, only column A
df_multi.xs(1, level='Num')['A']

Group
G1    0.413164
G2   -0.259616
Name: A, dtype: float64

# Get Num=1 rows, both A and B
df_multi.xs(1, level='Num')[['A', 'B']]

	A	B
Group
G1	0.413164	1.562466
G2	-0.259616	0.475404

### EXERCISE: Working with MultiIndex
#
# Create a MultiIndex DataFrame:
regions = ['North', 'North', 'North', 'South', 'South', 'South']
quarters = ['Q1', 'Q2', 'Q3', 'Q1', 'Q2', 'Q3']
multi_idx = pd.MultiIndex.from_arrays([regions, quarters], names=['Region', 'Quarter'])
sales_multi = pd.DataFrame({
    'Revenue': [100, 150, 120, 80, 90, 110],
    'Costs': [60, 70, 65, 50, 55, 60]
}, index=multi_idx)
#
# Tasks:
# 1. Display the DataFrame
# 2. Select all data for the 'North' region using .loc[]
# 3. Select all Q1 data across both regions using .xs()
# 4. Get the Revenue for South region, Q2
#
### Your code starts here:





### Your code ends here.

MultiIndex DataFrame:
                Revenue  Costs
Region Quarter                
North  Q1           100     60
       Q2           150     70
       Q3           120     65
South  Q1            80     50
       Q2            90     55
       Q3           110     60

Data for North region:
         Revenue  Costs
Quarter                
Q1           100     60
Q2           150     70
Q3           120     65

Q1 data across all regions:
        Revenue  Costs
Region                
North       100     60
South        80     50

Revenue for South region, Q2:
90