4.4. Missing Data#

Hide code cell source

import sys
from pathlib import Path

current = Path.cwd()
for parent in [current, *current.parents]:
    if (parent / '_config.yml').exists():
        project_root = parent  # ← Add project root, not chapters
        break
else:
    project_root = Path.cwd().parent.parent

sys.path.insert(0, str(project_root))

from shared import thinkpython, diagram, jupyturtle
import numpy as np
import pandas as pd

Missing data is common in real-world datasets and can affect analysis, aggregation, and model training.
In pandas, missing values are represented with special sentinel markers (placeholder values that mean “missing”), not with a separate universal null type.

Common missing-value markers in pandas:

  • None: Python’s null singleton. In pandas, it is treated as missing and often appears in object columns.

  • np.nan (NaN): IEEE floating-point “Not a Number,” commonly used for missing values in numeric/float contexts.

  • pd.NA: pandas’ missing-value scalar for nullable extension dtypes (for example Int64, boolean, and string), which helps preserve logical dtypes.

  • pd.NaT: pandas’ missing-value marker for datetime-like values (datetime64, timedelta64, etc.).

Important comparison behavior:

  • np.nan != np.nan is True

  • pd.NA == pd.NA returns <NA> (unknown), not True

Because of this, detect missing values with isna() / notna() rather than equality checks.

The following table summarizes the four sentinel missing value markers in Pandas:

Marker

Full Name

Introduced By

dtype

Common?

Use Case

None

None

Python

object

Most common

Missing values in string/object columns

NaN (np.nan)

Not a Number

NumPy

float64

Most common

Missing values in numerical/float columns

NA (pd.NA)

Not Available

Pandas 1.0 (new)

nullable extension dtypes

Growing

Missing marker for nullable dtypes (e.g., Int64, boolean, string)

NaT (pd.NaT)

Not a Time

Pandas

datetime/timedelta

Specialized

Missing values in datetime or timedelta columns

Let’s explore each of these sentinel values in detail, starting with None.

4.4.1. None as a Sentinel Value#

A sentinel value is a special value used to signal that data is missing, invalid, or absent — essentially a placeholder that means “there’s nothing here.” In Pandas, the choice of sentinel value depends on the data type:

None as a sentinel:

  • Python’s native None object is used for object/string arrays

  • When you include None in a NumPy array, the entire array is forced to object dtype

  • This is because None is a Python object, not a native NumPy type

  • Object arrays are usually slower and less type-stable; many operations fall back to Python objects

Why NaN for numerical data:

  • For numerical arrays, Pandas uses NaN (Not a Number) as the sentinel instead

  • NaN is a special IEEE 754 floating-point value that can coexist with numbers

  • This preserves native numerical dtypes and enables fast, compiled operations

  • However, it forces integer arrays to become float arrays (since NaN is a float value)

Pay attention to how dtypes change in the following examples:

### dtype is int64

arr = np.array([1, 1, 2, 3])
arr.dtype
dtype('int64')

In the following context, NumPy infers the arr elements as Python objects because of None.

arr = np.array([1, None, 2, 3])
print("arr.dtype:", arr.dtype)
arr
arr.dtype: object
array([1, None, 2, 3], dtype=object)

The problem with object dtype is, when None forces an array to object dtype, NumPy operations break because they expect native numerical types:

%%expect TypeError

arr.sum()     ### will generate a TypeError
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Forcing dtype=float to avoid TypeError:

To prevent the TypeError with object dtype arrays, you can explicitly set dtype=float when creating the array. This converts None to NaN, which NumPy can handle natively.

However, this doesn’t solve the missing data problem — it just changes how NumPy handles it. Arithmetic operations with NaN propagate the missing value through calculations, so the sum still results in NaN. This behavior is intentional: it forces you to explicitly decide how to handle missing data rather than silently ignoring it.

arr = np.array([1, None, 2, 3], dtype=float)

print(arr[1])           ### None is converted to NaN (Not a Number) when using float dtype

arr.sum()               ### NaN propagates through calculations, so the result is NaN
nan
np.float64(nan)
%%expect TypeError

### EXERCISE: Working with None in NumPy Arrays
# 1. print: Create a NumPy array [1, None, 2, 3], call it arr, and then:
# 2. print and observe its dtype — what type does NumPy infer?
# 3. Try calling arr.sum() and note what happens.
# 4. print: create the same array with dtype=float, call it arr_float 
# 5. print arr_float.sum() again and observe the result.
### Your code starts here:



### Your code ends here.

Hide code cell source

# Solution
# Create an array with None
arr = np.array([1, None, 2, 3])
print("arr:", arr)
print("dtype of arr:", arr.dtype)

print()
# With dtype=float
arr_float = np.array([1, None, 2, 3], dtype=float)
print("arr_float:", arr_float)
print("dtype of arr_float:", arr_float.dtype)
print("sum of arr_float:", arr_float.sum())
arr: [1 None 2 3]
dtype of arr: object

arr_float: [ 1. nan  2.  3.]
dtype of arr_float: float64
sum of arr_float: nan

4.4.2. NaN: Missing Numerical Data#

Unlike None, NaN (Not a Number) is a special IEEE 754 floating-point value that’s standardized across computing systems. When you create an array with NaN, NumPy keeps the native floating-point dtype instead of converting to object dtype.

Key advantages of NaN over None:

  • Preserves numerical dtype (float64) rather than forcing object dtype

  • Enables fast, vectorized operations

  • Works seamlessly with NumPy’s mathematical functions

  • Recognized by specialized functions like np.nansum(), np.nanmean(), etc.

Creating an array with NaN values while preserving float dtype:

arr = np.array([1, np.nan, 3, 4], dtype=float)
print(type(arr))
arr.dtype
<class 'numpy.ndarray'>
dtype('float64')

4.4.2.1. Standard sum with NaN#

When you use regular NumPy operations like np.sum() on an array containing NaN, the result propagates the missing value — the entire sum becomes NaN. This forces you to explicitly handle missing data rather than silently ignoring it:

np.sum(arr)
np.float64(nan)

4.4.2.2. NaN-aware functions#

NumPy provides specialized functions like np.nansum(), np.nanmean(), and np.nanstd() that ignore NaN values during computation. These allow you to work with incomplete data while getting meaningful results:

### Examples of NaN-aware functions

print(f"Sum (ignoring NaN):\t {np.nansum(arr)}")
print(f"Mean (ignoring NaN):\t {np.nanmean(arr)}")
print(f"Std (ignoring NaN):\t {np.nanstd(arr)}")
print(f"Min (ignoring NaN):\t {np.nanmin(arr)}")
print(f"Max (ignoring NaN):\t {np.nanmax(arr)}")
Sum (ignoring NaN):	 8.0
Mean (ignoring NaN):	 2.6666666666666665
Std (ignoring NaN):	 1.247219128924647
Min (ignoring NaN):	 1.0
Max (ignoring NaN):	 4.0

4.4.2.3. Limitation of NaN:**#

A key limitation of NaN is that it’s defined only for floating-point numbers—there’s no native NaN sentinel for integers, strings, or other types.

### EXERCISE: Using NaN-Aware Functions
### 1. print: Create a NumPy array with some np.nan values: 
#   np.nan, 3, np.nan, 5
### 2. print: sum the array using np.sum() 
### 3. print: sum the array using np.nansum().
### 4. print: the mean of arry using np.nanmean() 
### 5. print: the standard deviation of array using np.nanstd().
### Your code begins here





### Your code ends here

Hide code cell source

# Solution
arr = np.array([1, np.nan, 3, np.nan, 5])
print("the array: ", arr)
arr = np.array([1, np.nan, 3, np.nan, 5])
print("np.sum()    :", np.sum(arr))        # nan  – NaN propagates
print("np.nansum() :", np.nansum(arr))     # 9.0  – NaN ignored
print("np.nanmean():", np.nanmean(arr))    # 3.0  – mean of valid values
print("np.nanstd() :", np.nanstd(arr))     # std  – NaN ignored
the array:  [ 1. nan  3. nan  5.]
np.sum()    : nan
np.nansum() : 9.0
np.nanmean(): 3.0
np.nanstd() : 1.632993161855452

4.4.3. None, NaN, and NA in Pandas#

Both None and NaN serve as missing-value markers in Pandas, and the library treats them nearly interchangeably, automatically converting between them as needed.

A Series with both np.nan and None shows that Pandas converts both to NaN and uses a float64 dtype:

pd.Series( [ 1, np.nan, 2, None ] )
0    1.0
1    NaN
2    2.0
3    NaN
dtype: float64

4.4.3.1. Dtype promotion and upcasting#

When Pandas needs to store values with different types in a single Series or array, it “promotes” to a more general dtype that can accommodate all values. This is especially important for missing values.

Since many dtypes don’t have a native missing-value representation, Pandas must upcast to a compatible type:

  • Integers are promoted to float64 (because NaN is a float value)

  • Booleans are promoted to object (to accommodate None)

  • Floats stay as float (already support NaN)

  • Objects stay as object (already support None or NaN)

The typical promotion hierarchy:

  • boolintfloatcomplex

  • For Pandas-specific types: intfloat (when NaN is needed), or → nullable dtypes like Int64 (when pd.NA is used)

The examples below demonstrate how Pandas handles dtype conversion automatically when missing values are introduced:

ser = pd.Series(range(3), dtype=int)
print("=== ser: ===")
print(ser, "\n")
print("the dtype of ser is: ", ser.dtype)

ser[0] = None           ### update element[0] to None
print("\n=== ser updated with None: ===")
print(ser)
print(f"\npandas upcast the type to: {ser.dtype}")
=== ser: ===
0    0
1    1
2    2
dtype: int64 

the dtype of ser is:  int64

=== ser updated with None: ===
0    NaN
1    1.0
2    2.0
dtype: float64

pandas upcast the type to: float64

4.4.3.2. Explicit Nullable Integer#

Here we explicitly request pandas’ nullable integer dtype (Int64) so missing values are represented with pd.NA instead of forcing float upcasting.

Pandas adds nullable dtypes (NA) to address situations where type casting is an issue. For example, how to represent a true integer array with missing data. Int64 (capital I) is pandas’ nullable integer dtype, different from NumPy’s int64 (lowercase), which is not nullable. These dtypes have capitalization of their names (e.g., Int64 vs int64) and are used only when explicitly requested for backward compatibility. For example:

### requesting NA: dtype=Int64, not int64

pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int64')
0       1
1    <NA>
2       2
3    <NA>
4    <NA>
dtype: Int64

4.4.3.3. Nullable Dtypes in Practice#

Use nullable dtypes when you want missing values without losing the original logical type:

  • Int64 for integers with missing values

  • boolean for three-state logic (True/False/<NA>)

  • string for text with pd.NA

# Compare default inference vs explicit nullable dtype
s_default = pd.Series([1, None, 3])
s_nullable = pd.Series([1, None, 3], dtype='Int64') ### upcast to nullable integer dtype

print('default dtype  :', s_default.dtype)
print('nullable dtype :', s_nullable.dtype)
print(s_nullable)

flags = pd.Series([True, False, pd.NA], dtype='boolean')
names = pd.Series(['Alice', None, 'Charlie'], dtype='string')
print('flags dtype    :', flags.dtype)
print('names dtype    :', names.dtype)
default dtype  : float64
nullable dtype : Int64
0       1
1    <NA>
2       3
dtype: Int64
flags dtype    : boolean
names dtype    : string
# pd.NA follows 3-valued logic: comparisons can return <NA>
print('pd.NA == pd.NA ->', pd.NA == pd.NA)
print('pd.isna(pd.NA) ->', pd.isna(pd.NA))

print()

mask = s_nullable > 1
print('mask values:')
print(mask)

print()

# For indexing, convert unknown mask entries to False
print('safe filter result:')
print(s_nullable[mask.fillna(False)])
pd.NA == pd.NA -> <NA>
pd.isna(pd.NA) -> True

mask values:
0    False
1     <NA>
2     True
dtype: boolean

safe filter result:
2    3
dtype: Int64

In summary, Pandas has two common missing-data paths: the default legacy upcasting behavior, and nullable extension dtypes.

Type class

Default path (with None/np.nan)

Nullable path (explicit nullable dtype)

Missing marker

floating

Stays float64

Float64 (optional)

np.nan or pd.NA

object/text

Stays object

string

None/np.nan or pd.NA

integer

Upcasts to float64

Stays nullable integer (Int64, etc.)

np.nan or pd.NA

boolean

Upcasts to object

Stays nullable boolean

None/np.nan or pd.NA

### EXERCISE: Pandas Missing Value Handling
### 1. print: Create a Pandas Series with a mix of None, np.nan, and regular values.
### Observe the dtype. Then create the same Series with dtype='Int64'.
### What missing-value marker does each version use?
### 2. print: Create the same Series with nullable Int64 dtype
### 3. produce the same results as seen below. 
### Your code begins here




### Your code ends here

Hide code cell source

# Solution
s = pd.Series([1, None, 2, np.nan, 3])
print("Default dtype:", s.dtype)   # float64 — None and NaN both become NaN
print(s)

s_int = pd.Series([1, pd.NA, 2, pd.NA, 3], dtype='Int64')
print("\nNullable Int64 dtype:", s_int.dtype)   # Int64 — uses pd.NA
print(s_int)
Default dtype: float64
0    1.0
1    NaN
2    2.0
3    NaN
4    3.0
dtype: float64

Nullable Int64 dtype: Int64
0       1
1    <NA>
2       2
3    <NA>
4       3
dtype: Int64

4.4.4. Null Value Operations#

Pandas provides a small set of core tools for null-value work:

Tool

Purpose

Typical use

isna() / isnull()

Detect missing values

Build a Boolean mask (True where values are missing)

notna() / notnull()

Detect non-missing values

Filter to valid entries (True where values are present)

dropna()

Remove missing data

Drop rows/columns with nulls based on rules (axis, how, thresh, subset)

fillna()

Replace missing data

Fill with constants, statistics, or method-based values

isnull() and notnull() are aliases for isna() and notna().

4.4.4.1. Check Your Objects#

Before applying null-value fixes, run a quick structural check:

Check

Why it matters

df.info()

Confirms shape, non-null counts, and memory usage

df.isna().sum()

Counts missing values per column

df.dtypes

Verifies column types before/after cleaning

Also, isna is available both as a top-level function (pd.isna) and as object methods (Series.isna, DataFrame.isna).

### build a DataFrame
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })
df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3

4.4.4.1.1. df.info( )#

df.info() prints a compact summary of the DataFrame: row count, column names, non-null counts, dtypes, and memory usage. For missing-data checks, the key part is the Non-Null Count column, which tells you how many values are present in each column.

### df.info() will show non-null count

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 200.0 bytes

4.4.4.1.2. isna().sum()#

df.isna() creates a boolean DataFrame (True for missing, False for present). Chaining .sum() counts True values per column, so the result shows how many missing values each column contains.

### Count missing values by column

df_nan_sum = df.isna().sum()
print("Sum of NaN's:")
print(df_nan_sum)
Sum of NaN's:
A    1
B    2
C    0
dtype: int64

4.4.4.1.3. The dtypes Attribute#

df.dtypes returns a Series where the index is column names and each value is that column’s data type. The last line in the display (dtype: object) is the dtype of this resulting Series (not the dtype of your DataFrame columns).

### Inspect dtypes

df.dtypes
A    float64
B    float64
C      int64
dtype: object

4.4.4.2. Detecting Null Values#

Use these paired methods to build Boolean masks:

Method

Meaning of True

Common use

isnull() (isna())

Value is missing

Locate/count nulls

notnull() (notna())

Value is present

Keep valid entries

Let’s start with a Pandas Series.

ser = pd.Series([1, np.nan, 'hello', None])
ser
0        1
1      NaN
2    hello
3     None
dtype: object

isnull()/notnull() can be called as Series methods or as top-level pandas functions; both forms return the same Boolean mask.

### method vs function forms (same result)

print("isnull() as method:")
print(ser.isnull())

print("\nisnull() as top-level function:")
print(pd.isnull(ser))

print("\nnotnull() as method:")
print(ser.notnull())

print("\nnotnull() as top-level function:")
print(pd.notnull(ser))
isnull() as method:
0    False
1     True
2    False
3     True
dtype: bool

isnull() as top-level function:
0    False
1     True
2    False
3     True
dtype: bool

notnull() as method:
0     True
1    False
2     True
3    False
dtype: bool

notnull() as top-level function:
0     True
1    False
2     True
3    False
dtype: bool

Here booleans mask as index in Series or DataFrame

print("ser[ser.isnull()]:")
print()
print(ser.isnull())
print()
print(ser[ser.isnull()])

print()

print("\nser[ser.notnull()]:")
print(ser[ser.notnull()])
ser[ser.isnull()]:

0    False
1     True
2    False
3     True
dtype: bool

1     NaN
3    None
dtype: object


ser[ser.notnull()]:
0        1
2    hello
dtype: object

Now let’s look at a Pandas DataFrame.

df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
df.isnull()
A B C
0 False False False
1 False True False
2 True True False
df.notnull()
A B C
0 True True True
1 True False True
2 False False True

Again, boolean masks as index

df[ df.notnull() ]
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
### EXERCISE: Detecting Missing Values
#
# 1. print: creat a Series ("ser") with the elements:
#    np.nan, 'hello', None, 5 
# 2. print: Use isnull() to create a boolean mask
# 3. print: Count how many missing values are in the Series
# 4. print: Filter the Series to show only non-null values
### Your code starts here:




### Your code ends here.

Hide code cell source

# Solution

# Create the Series
ser = pd.Series([1, np.nan, 'hello', None, 5])
print("Original Series:")
print(ser)
print()

# 1. Boolean mask
print("Boolean mask (isnull):")
print(ser.isnull())
print()

# 2. Count missing values
print("Number of missing values:", ser.isnull().sum())
print()

# 3. Filter to non-null values
print("Non-null values:")
print(ser[ser.notnull()])
Original Series:
0        1
1      NaN
2    hello
3     None
4        5
dtype: object

Boolean mask (isnull):
0    False
1     True
2    False
3     True
4    False
dtype: bool

Number of missing values: 2

Non-null values:
0        1
2    hello
4        5
dtype: object

4.4.4.3. Dropping Null Values#

Beyond masking, pandas provides dropna() to remove missing entries. On a Series, its behavior is straightforward:

ser = pd.Series([1, np.nan, 'hello', None])
ser
0        1
1      NaN
2    hello
3     None
dtype: object
ser.dropna()
0        1
2    hello
dtype: object

In a DataFrame, dropna() removes whole rows or whole columns, not individual cells.

  • By default, it returns a new object with missing values removed.

  • Use inplace=True only if you want to modify the original object directly.

df = pd.DataFrame(
    [
        [1, np.nan, 2],
        [2, 3, 5],
        [np.nan, 4, 6]
    ]
)

df
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
### dropping rows by default

df.dropna()
0 1 2
1 2.0 3.0 5
### dropping columns instead

# df.dropna(axis=1)    ### the same as below
df.dropna(axis='columns')
2
0 2
1 5
2 6

thresh means: keep rows (or columns) that have at least that many non-missing values.

  • df.dropna(thresh=2) keeps rows with 2 or more non-NaN values

  • rows with fewer than 2 non-missing values are dropped

So thresh sets a minimum data-completeness requirement before keeping a row/column.

df.dropna(thresh=2)
0 1 2
0 1.0 NaN 2
1 2.0 3.0 5
2 NaN 4.0 6
### EXERCISE: Dropping Missing Values
#
# Create a DataFrame:
# df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan], 'C': [np.nan, np.nan, np.nan]})
# 1. print: the df
# 2. print: Drop rows with any missing values
# 3. Drop columns where ALL values are missing
# 4. Drop rows only if they have missing values in column 'A'
### Your code starts here:




### Your code ends here.

Hide code cell source

# Solution

import numpy as np
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan], 'C': [np.nan, np.nan, np.nan]})
print("Original DataFrame:")
print(df)
print()

# 1. Drop rows with any missing values
print("Drop rows with any NaN:")
print(df.dropna())
print()

# 2. Drop columns where ALL values are missing
print("Drop columns where all values are NaN:")
print(df.dropna(axis=1, how='all'))
print()

# 3. Drop rows with missing values in column 'A'
print("Drop rows with NaN in column 'A':")
print(df.dropna(subset=['A']))
Original DataFrame:
     A    B   C
0  1.0  4.0 NaN
1  NaN  5.0 NaN
2  3.0  NaN NaN

Drop rows with any NaN:
Empty DataFrame
Columns: [A, B, C]
Index: []

Drop columns where all values are NaN:
     A    B
0  1.0  4.0
1  NaN  5.0
2  3.0  NaN

Drop rows with NaN in column 'A':
     A    B   C
0  1.0  4.0 NaN
2  3.0  NaN NaN

4.4.4.3.1. Filling Null Values#

Instead of dropping missing values, you may want to substitute a valid value—either a constant (e.g., 0) or an estimate via imputation (e.g., mean) or interpolation (estimated values between observed points). While you could do this with a Boolean mask from isna()/isnull(), Pandas offers the dedicated fillna() method, which returns a new object (or can operate in place) with nulls replaced.

ser = pd.Series(
    [1, np.nan, 3, None, 5],
    index=list('abcde'),
    dtype='Int64'
)
ser
a       1
b    <NA>
c       3
d    <NA>
e       5
dtype: Int64

We can fill NA entries with a single value such as 0:

# ser.fillna(0)  ### fill NA entries with a single value, such as zero
ser.fillna(value=0)
a    1
b    0
c    3
d    0
e    5
dtype: Int64
### We can specify a forward fill to propagate the previous value forward:
ser.ffill()
a    1
b    1
c    3
d    3
e    5
dtype: Int64
### Or we can specify a backward fill to propagate the next values backward:
ser.bfill()
a    1
b    3
c    3
d    5
e    5
dtype: Int64

For DataFrames, the options are similar: You can additionally specify the axis (rows or columns) along which the fill should be applied:

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })
df
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
### fill a column (Series) with the mean of that column

df['A'].fillna(value=df['A'].mean())
0    1.0
1    2.0
2    1.5
Name: A, dtype: float64
print(df) ### original code

### ffill along rows (default)

df.ffill()
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
A B C
0 1.0 5.0 1
1 2.0 5.0 2
2 2.0 5.0 3
print(df) ### original code

### bfill along columns 

df.bfill(axis=1)
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
A B C
0 1.0 5.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
### fillna with the mean of each column (numeric only)
### Per-column mean (most common)

print(df) ### original code

# df.fillna(value=df.mean(numeric_only=True), inplace=True)
df.fillna(value=df.mean(numeric_only=True))
     A    B  C
0  1.0  5.0  1
1  2.0  NaN  2
2  NaN  NaN  3
A B C
0 1.0 5.0 1
1 2.0 5.0 2
2 1.5 5.0 3
### Advanced; FYI only
### T is the transpose of the DataFrame, which swaps rows and columns. 
#   This allows us to compute the mean across rows instead of columns.
### Per-row mean (fill each row’s NaNs with that row’s mean)
df.T.fillna(value=df.T.mean(numeric_only=True)).T
A B C
0 1.0 5.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
### Advanced; FYI only
### Per-row mean using apply and a lambda function

df.apply(lambda row: row.fillna(row.mean()), axis=1)
A B C
0 1.0 5.0 1.0
1 2.0 2.0 2.0
2 3.0 3.0 3.0
### EXERCISE: Filling Missing Values
#
# Create a DataFrame using the dictionary: {'X': [1, 2, np.nan, 4], 'Y': [np.nan, 2, 3, 4]}
# 1. Fill missing values in column 'X' with the mean of 'X'
# 2. Fill missing values in column 'Y' with forward fill
# 3. Fill all remaining NaN with 0
#
### Your code starts here:




### Your code ends here.

Hide code cell source

# Solution

import numpy as np
import pandas as pd

# Create the DataFrame
df = pd.DataFrame({'X': [1, 2, np.nan, 4], 'Y': [np.nan, 2, 3, 4]})
print("Original DataFrame:")
print(df)
print()

# 1. Fill 'X' with mean
df_filled = df.copy()
df_filled['X'] = df_filled['X'].fillna(df_filled['X'].mean())
print("After filling X with mean:")
print(df_filled)
print()

# 2. Fill 'Y' with forward fill
df_filled['Y'] = df_filled['Y'].ffill()
print("After forward filling Y:")
print(df_filled)
print()

# 3. Fill remaining with 0
df_filled = df_filled.fillna(0)
print("After filling remaining with 0:")
print(df_filled)
Original DataFrame:
     X    Y
0  1.0  NaN
1  2.0  2.0
2  NaN  3.0
3  4.0  4.0

After filling X with mean:
          X    Y
0  1.000000  NaN
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0

After forward filling Y:
          X    Y
0  1.000000  NaN
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0

After filling remaining with 0:
          X    Y
0  1.000000  0.0
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0
### EXERCISE: Handling Missing Data with dropna() and fillna()
df = pd.DataFrame({'X': [1, 2, np.nan, 4], 'Y': [np.nan, 2, 3, 4]})
### Using the DataFrame df, perform the following steps:
### 1. Check missing values per column with isna().sum()
### 2. Drop rows that have any missing values; compare shapes
### 3. Fill a numeric column's NaN with the column mean
### 4. Try dropna(thresh=2) and observe the difference
### Your code begins here

# 1. Check missing counts

# 2. Drop rows with any NaN

# 3. Fill a numeric column with the mean

# 4. Try thresh

### Your code ends here

Hide code cell source

# Solution
# 1. Missing counts per column
print("1. Missing values:\n", df.isna().sum())

# 2. Drop rows with any NaN
df_clean = df.dropna()
print(f"\n2. Original shape: {df.shape}  →  After dropna(): {df_clean.shape}")

# 3. Fill a numeric column with its mean
num_col = df.select_dtypes(include='number').columns[0]
df_filled = df.copy()
df_filled[num_col] = df[num_col].fillna(df[num_col].mean())
print(f"\n3. Filled '{num_col}' NaN with mean ({df[num_col].mean():.2f})")

# 4. thresh — keep rows with at least 2 non-NaN values
df_thresh = df.dropna(thresh=2)
print(f"\n4. After dropna(thresh=2) shape: {df_thresh.shape}")
1. Missing values:
 X    1
Y    1
dtype: int64

2. Original shape: (4, 2)  →  After dropna(): (2, 2)

3. Filled 'X' NaN with mean (2.33)

4. After dropna(thresh=2) shape: (2, 2)