Missing Data

4.4. Missing Data#

import numpy as np
import pandas as pd

Missing data is common in real-world datasets and can affect analysis, aggregation, and model training.
In pandas, missing values are represented with special sentinel markers (placeholder values that mean “missing”), not with a separate universal null type.

Common missing-value markers in pandas:

None: Python’s null singleton. In pandas, it is treated as missing and often appears in object columns.
np.nan (NaN): IEEE floating-point “Not a Number,” commonly used for missing values in numeric/float contexts.
pd.NA: pandas’ missing-value scalar for nullable extension dtypes (for example Int64, boolean, and string), which helps preserve logical dtypes.
pd.NaT: pandas’ missing-value marker for datetime-like values (datetime64, timedelta64, etc.).

Important comparison behavior:

np.nan != np.nan is True
pd.NA == pd.NA returns <NA> (unknown), not True

Because of this, detect missing values with isna() / notna() rather than equality checks.

The following table summarizes the four sentinel missing value markers in Pandas:

Marker	Full Name	Introduced By	dtype	Common?	Use Case
`None`	None	Python	object	Most common	Missing values in string/object columns
`NaN` (`np.nan`)	Not a Number	NumPy	float64	Most common	Missing values in numerical/float columns
`NA` (`pd.NA`)	Not Available	Pandas 1.0 (new)	nullable extension dtypes	Growing	Missing marker for nullable dtypes (e.g., `Int64`, `boolean`, `string`)
`NaT` (`pd.NaT`)	Not a Time	Pandas	datetime/timedelta	Specialized	Missing values in datetime or timedelta columns

Let’s explore each of these sentinel values in detail, starting with None.

4.4.1. None as a Sentinel Value#

A sentinel value is a special value used to signal that data is missing, invalid, or absent — essentially a placeholder that means “there’s nothing here.” In Pandas, the choice of sentinel value depends on the data type:

None as a sentinel:

Python’s native None object is used for object/string arrays
When you include None in a NumPy array, the entire array is forced to object dtype
This is because None is a Python object, not a native NumPy type
Object arrays are usually slower and less type-stable; many operations fall back to Python objects

Why NaN for numerical data:

For numerical arrays, Pandas uses NaN (Not a Number) as the sentinel instead
NaN is a special IEEE 754 floating-point value that can coexist with numbers
This preserves native numerical dtypes and enables fast, compiled operations
However, it forces integer arrays to become float arrays (since NaN is a float value)

Pay attention to how dtypes change in the following examples:

### dtype is int64

arr = np.array([1, 1, 2, 3])
arr.dtype

dtype('int64')

In the following context, NumPy infers the arr elements as Python objects because of None.

arr = np.array([1, None, 2, 3])
print("arr.dtype:", arr.dtype)
arr

arr.dtype: object

array([1, None, 2, 3], dtype=object)

The problem with object dtype is, when None forces an array to object dtype, NumPy operations break because they expect native numerical types:

%%expect TypeError

arr.sum()     ### will generate a TypeError

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

Forcing dtype=float to avoid TypeError:

To prevent the TypeError with object dtype arrays, you can explicitly set dtype=float when creating the array. This converts None to NaN, which NumPy can handle natively.

However, this doesn’t solve the missing data problem — it just changes how NumPy handles it. Arithmetic operations with NaN propagate the missing value through calculations, so the sum still results in NaN. This behavior is intentional: it forces you to explicitly decide how to handle missing data rather than silently ignoring it.

arr = np.array([1, None, 2, 3], dtype=float)

print(arr[1])           ### None is converted to NaN (Not a Number) when using float dtype

arr.sum()               ### NaN propagates through calculations, so the result is NaN

nan

np.float64(nan)

%%expect TypeError

### EXERCISE: Working with None in NumPy Arrays
# 1. print: Create a NumPy array [1, None, 2, 3], call it arr, and then:
# 2. print and observe its dtype — what type does NumPy infer?
# 3. Try calling arr.sum() and note what happens.
# 4. print: create the same array with dtype=float, call it arr_float 
# 5. print arr_float.sum() again and observe the result.
### Your code starts here:



### Your code ends here.

arr: [1 None 2 3]
dtype of arr: object

arr_float: [ 1. nan  2.  3.]
dtype of arr_float: float64
sum of arr_float: nan

4.4.2. NaN: Missing Numerical Data#

Unlike None, NaN (Not a Number) is a special IEEE 754 floating-point value that’s standardized across computing systems. When you create an array with NaN, NumPy keeps the native floating-point dtype instead of converting to object dtype.

Key advantages of NaN over None:

Preserves numerical dtype (float64) rather than forcing object dtype
Enables fast, vectorized operations
Works seamlessly with NumPy’s mathematical functions
Recognized by specialized functions like np.nansum(), np.nanmean(), etc.

Creating an array with NaN values while preserving float dtype:

arr = np.array([1, np.nan, 3, 4], dtype=float)
print(type(arr))
arr.dtype

<class 'numpy.ndarray'>

dtype('float64')

4.4.2.1. Standard sum with NaN#

When you use regular NumPy operations like np.sum() on an array containing NaN, the result propagates the missing value — the entire sum becomes NaN. This forces you to explicitly handle missing data rather than silently ignoring it:

np.sum(arr)

np.float64(nan)

4.4.2.2. NaN-aware functions#

NumPy provides specialized functions like np.nansum(), np.nanmean(), and np.nanstd() that ignore NaN values during computation. These allow you to work with incomplete data while getting meaningful results:

### Examples of NaN-aware functions

print(f"Sum (ignoring NaN):\t {np.nansum(arr)}")
print(f"Mean (ignoring NaN):\t {np.nanmean(arr)}")
print(f"Std (ignoring NaN):\t {np.nanstd(arr)}")
print(f"Min (ignoring NaN):\t {np.nanmin(arr)}")
print(f"Max (ignoring NaN):\t {np.nanmax(arr)}")

Sum (ignoring NaN):	 8.0
Mean (ignoring NaN):	 2.6666666666666665
Std (ignoring NaN):	 1.247219128924647
Min (ignoring NaN):	 1.0
Max (ignoring NaN):	 4.0

4.4.2.3. Limitation of NaN:**#

A key limitation of NaN is that it’s defined only for floating-point numbers—there’s no native NaN sentinel for integers, strings, or other types.

### EXERCISE: Using NaN-Aware Functions
### 1. print: Create a NumPy array with some np.nan values: 
#   np.nan, 3, np.nan, 5
### 2. print: sum the array using np.sum() 
### 3. print: sum the array using np.nansum().
### 4. print: the mean of arry using np.nanmean() 
### 5. print: the standard deviation of array using np.nanstd().
### Your code begins here





### Your code ends here

the array:  [ 1. nan  3. nan  5.]
np.sum()    : nan
np.nansum() : 9.0
np.nanmean(): 3.0
np.nanstd() : 1.632993161855452

4.4.3. None, NaN, and NA in Pandas#

Both None and NaN serve as missing-value markers in Pandas, and the library treats them nearly interchangeably, automatically converting between them as needed.

A Series with both np.nan and None shows that Pandas converts both to NaN and uses a float64 dtype:

pd.Series( [ 1, np.nan, 2, None ] )

  1.0
  NaN
  2.0
  NaN
dtype: float64

4.4.3.1. Dtype promotion and upcasting#

When Pandas needs to store values with different types in a single Series or array, it “promotes” to a more general dtype that can accommodate all values. This is especially important for missing values.

Since many dtypes don’t have a native missing-value representation, Pandas must upcast to a compatible type:

Integers are promoted to float64 (because NaN is a float value)
Booleans are promoted to object (to accommodate None)
Floats stay as float (already support NaN)
Objects stay as object (already support None or NaN)

The typical promotion hierarchy:

bool → int → float → complex
For Pandas-specific types: int → float (when NaN is needed), or → nullable dtypes like Int64 (when pd.NA is used)

The examples below demonstrate how Pandas handles dtype conversion automatically when missing values are introduced:

ser = pd.Series(range(3), dtype=int)
print("=== ser: ===")
print(ser, "\n")
print("the dtype of ser is: ", ser.dtype)

ser[0] = None           ### update element[0] to None
print("\n=== ser updated with None: ===")
print(ser)
print(f"\npandas upcast the type to: {ser.dtype}")

=== ser: ===
0    0
1    1
2    2
dtype: int64 

the dtype of ser is:  int64

=== ser updated with None: ===
0    NaN
1    1.0
2    2.0
dtype: float64

pandas upcast the type to: float64

4.4.3.2. Explicit Nullable Integer#

Here we explicitly request pandas’ nullable integer dtype (Int64) so missing values are represented with pd.NA instead of forcing float upcasting.

Pandas adds nullable dtypes (NA) to address situations where type casting is an issue. For example, how to represent a true integer array with missing data. Int64 (capital I) is pandas’ nullable integer dtype, different from NumPy’s int64 (lowercase), which is not nullable. These dtypes have capitalization of their names (e.g., Int64 vs int64) and are used only when explicitly requested for backward compatibility. For example:

### requesting NA: dtype=Int64, not int64

pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int64')

     1
  <NA>
     2
  <NA>
  <NA>
dtype: Int64

4.4.3.3. Nullable Dtypes in Practice#

Use nullable dtypes when you want missing values without losing the original logical type:

Int64 for integers with missing values
boolean for three-state logic (True/False/<NA>)
string for text with pd.NA

# Compare default inference vs explicit nullable dtype
s_default = pd.Series([1, None, 3])
s_nullable = pd.Series([1, None, 3], dtype='Int64') ### upcast to nullable integer dtype

print('default dtype  :', s_default.dtype)
print('nullable dtype :', s_nullable.dtype)
print(s_nullable)

flags = pd.Series([True, False, pd.NA], dtype='boolean')
names = pd.Series(['Alice', None, 'Charlie'], dtype='string')
print('flags dtype    :', flags.dtype)
print('names dtype    :', names.dtype)

default dtype  : float64
nullable dtype : Int64
0       1
1    <NA>
2       3
dtype: Int64
flags dtype    : boolean
names dtype    : string

# pd.NA follows 3-valued logic: comparisons can return <NA>
print('pd.NA == pd.NA ->', pd.NA == pd.NA)
print('pd.isna(pd.NA) ->', pd.isna(pd.NA))

print()

mask = s_nullable > 1
print('mask values:')
print(mask)

print()

# For indexing, convert unknown mask entries to False
print('safe filter result:')
print(s_nullable[mask.fillna(False)])

pd.NA == pd.NA -> <NA>
pd.isna(pd.NA) -> True

mask values:
0    False
1     <NA>
2     True
dtype: boolean

safe filter result:
2    3
dtype: Int64

In summary, Pandas has two common missing-data paths: the default legacy upcasting behavior, and nullable extension dtypes.

Type class	Default path (with `None`/`np.nan`)	Nullable path (explicit nullable dtype)	Missing marker
floating	Stays `float64`	`Float64` (optional)	`np.nan` or `pd.NA`
object/text	Stays `object`	`string`	`None`/`np.nan` or `pd.NA`
integer	Upcasts to `float64`	Stays nullable integer (`Int64`, etc.)	`np.nan` or `pd.NA`
boolean	Upcasts to `object`	Stays nullable `boolean`	`None`/`np.nan` or `pd.NA`

### EXERCISE: Pandas Missing Value Handling
### 1. print: Create a Pandas Series with a mix of None, np.nan, and regular values.
### Observe the dtype. Then create the same Series with dtype='Int64'.
### What missing-value marker does each version use?
### 2. print: Create the same Series with nullable Int64 dtype
### 3. produce the same results as seen below. 
### Your code begins here




### Your code ends here

Default dtype: float64
  1.0
  NaN
  2.0
  NaN
  3.0
dtype: float64

Nullable Int64 dtype: Int64
     1
  <NA>
     2
  <NA>
     3
dtype: Int64

4.4.4. Null Value Operations#

Pandas provides a small set of core tools for null-value work:

Tool	Purpose	Typical use
`isna()` / `isnull()`	Detect missing values	Build a Boolean mask (`True` where values are missing)
`notna()` / `notnull()`	Detect non-missing values	Filter to valid entries (`True` where values are present)
`dropna()`	Remove missing data	Drop rows/columns with nulls based on rules (`axis`, `how`, `thresh`, `subset`)
`fillna()`	Replace missing data	Fill with constants, statistics, or method-based values

isnull() and notnull() are aliases for isna() and notna().

4.4.4.1. Check Your Objects#

Before applying null-value fixes, run a quick structural check:

Check	Why it matters
`df.info()`	Confirms shape, non-null counts, and memory usage
`df.isna().sum()`	Counts missing values per column
`df.dtypes`	Verifies column types before/after cleaning

Also, isna is available both as a top-level function (pd.isna) and as object methods (Series.isna, DataFrame.isna).

### build a DataFrame
import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })
df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

4.4.4.1.1. `df.info( )`#

df.info() prints a compact summary of the DataFrame: row count, column names, non-null counts, dtypes, and memory usage. For missing-data checks, the key part is the Non-Null Count column, which tells you how many values are present in each column.

### df.info() will show non-null count

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 204.0 bytes

4.4.4.1.2. `isna().sum()`#

df.isna() creates a boolean DataFrame (True for missing, False for present). Chaining .sum() counts True values per column, so the result shows how many missing values each column contains.

### Count missing values by column

df_nan_sum = df.isna().sum()
print("Sum of NaN's:")
print(df_nan_sum)

Sum of NaN's:
A    1
B    2
C    0
dtype: int64

4.4.4.1.3. The `dtypes` Attribute#

df.dtypes returns a Series where the index is column names and each value is that column’s data type. The last line in the display (dtype: object) is the dtype of this resulting Series (not the dtype of your DataFrame columns).

### Inspect dtypes

df.dtypes

A    float64
B    float64
C      int64
dtype: object

4.4.4.2. Detecting Null Values#

Use these paired methods to build Boolean masks:

Method	Meaning of `True`	Common use
`isnull()` (`isna()`)	Value is missing	Locate/count nulls
`notnull()` (`notna()`)	Value is present	Keep valid entries

Let’s start with a Pandas Series.

ser = pd.Series([1, np.nan, 'hello', None])
ser

      1
    NaN
  hello
   None
dtype: object

isnull()/notnull() can be called as Series methods or as top-level pandas functions; both forms return the same Boolean mask.

### method vs function forms (same result)

print("isnull() as method:")
print(ser.isnull())

print("\nisnull() as top-level function:")
print(pd.isnull(ser))

print("\nnotnull() as method:")
print(ser.notnull())

print("\nnotnull() as top-level function:")
print(pd.notnull(ser))

isnull() as method:
0    False
1     True
2    False
3     True
dtype: bool

isnull() as top-level function:
0    False
1     True
2    False
3     True
dtype: bool

notnull() as method:
0     True
1    False
2     True
3    False
dtype: bool

notnull() as top-level function:
0     True
1    False
2     True
3    False
dtype: bool

Here booleans mask as index in Series or DataFrame

print("ser[ser.isnull()]:")
print()
print(ser.isnull())
print()
print(ser[ser.isnull()])

print()

print("\nser[ser.notnull()]:")
print(ser[ser.notnull()])

ser[ser.isnull()]:

0    False
1     True
2    False
3     True
dtype: bool

1     NaN
3    None
dtype: object


ser[ser.notnull()]:
0        1
2    hello
dtype: object

Now let’s look at a Pandas DataFrame.

df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

df.isnull()

	A	B	C
0	False	False	False
1	False	True	False
2	True	True	False

df.notnull()

	A	B	C
0	True	True	True
1	True	False	True
2	False	False	True

Again, boolean masks as index

df[ df.notnull() ]

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

### EXERCISE: Detecting Missing Values
#
# 1. print: creat a Series ("ser") with the elements:
#    np.nan, 'hello', None, 5 
# 2. print: Use isnull() to create a boolean mask
# 3. print: Count how many missing values are in the Series
# 4. print: Filter the Series to show only non-null values
### Your code starts here:




### Your code ends here.

Original Series:
0        1
1      NaN
2    hello
3     None
4        5
dtype: object

Boolean mask (isnull):
0    False
1     True
2    False
3     True
4    False
dtype: bool

Number of missing values: 2

Non-null values:
0        1
2    hello
4        5
dtype: object

4.4.4.3. Dropping Null Values#

Beyond masking, pandas provides dropna() to remove missing entries. On a Series, its behavior is straightforward:

ser = pd.Series([1, np.nan, 'hello', None])
ser

      1
    NaN
  hello
   None
dtype: object

ser.dropna()

0        1
2    hello
dtype: object

In a DataFrame, dropna() removes whole rows or whole columns, not individual cells.

By default, it returns a new object with missing values removed.
Use inplace=True only if you want to modify the original object directly.

df = pd.DataFrame(
    [
        [1, np.nan, 2],
        [2, 3, 5],
        [np.nan, 4, 6]
    ]
)

df

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

### dropping rows by default

df.dropna()

	0	1	2
1	2.0	3.0	5

### dropping columns instead

# df.dropna(axis=1)    ### the same as below
df.dropna(axis='columns')

	2
0	2
1	5
2	6

thresh means: keep rows (or columns) that have at least that many non-missing values.

df.dropna(thresh=2) keeps rows with 2 or more non-NaN values
rows with fewer than 2 non-missing values are dropped

So thresh sets a minimum data-completeness requirement before keeping a row/column.

df.dropna(thresh=2)

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

### EXERCISE: Dropping Missing Values
#
# Create a DataFrame:
# df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, np.nan], 'C': [np.nan, np.nan, np.nan]})
# 1. print: the df
# 2. print: Drop rows with any missing values
# 3. Drop columns where ALL values are missing
# 4. Drop rows only if they have missing values in column 'A'
### Your code starts here:




### Your code ends here.

Original DataFrame:
     A    B   C
0  1.0  4.0 NaN
1  NaN  5.0 NaN
2  3.0  NaN NaN

Drop rows with any NaN:
Empty DataFrame
Columns: [A, B, C]
Index: []

Drop columns where all values are NaN:
     A    B
0  1.0  4.0
1  NaN  5.0
2  3.0  NaN

Drop rows with NaN in column 'A':
     A    B   C
0  1.0  4.0 NaN
2  3.0  NaN NaN

4.4.4.3.1. Filling Null Values#

Instead of dropping missing values, you may want to substitute a valid value—either a constant (e.g., 0) or an estimate via imputation (e.g., mean) or interpolation (estimated values between observed points). While you could do this with a Boolean mask from isna()/isnull(), Pandas offers the dedicated fillna() method, which returns a new object (or can operate in place) with nulls replaced.

ser = pd.Series(
    [1, np.nan, 3, None, 5],
    index=list('abcde'),
    dtype='Int64'
)
ser

a       1
b    <NA>
c       3
d    <NA>
e       5
dtype: Int64

We can fill NA entries with a single value such as 0:

# ser.fillna(0)  ### fill NA entries with a single value, such as zero
ser.fillna(value=0)

a    1
b    0
c    3
d    0
e    5
dtype: Int64

### We can specify a forward fill to propagate the previous value forward:
ser.ffill()

a    1
b    1
c    3
d    3
e    5
dtype: Int64

### Or we can specify a backward fill to propagate the next values backward:
ser.bfill()

a    1
b    3
c    3
d    5
e    5
dtype: Int64

For DataFrames, the options are similar: You can additionally specify the axis (rows or columns) along which the fill should be applied:

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })
df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

### fill a column (Series) with the mean of that column

df['A'].fillna(value=df['A'].mean())

  1.0
  2.0
  1.5
Name: A, dtype: float64

print(df) ### original code

### ffill along rows (default)

df.ffill()

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1
1	2.0	5.0	2
2	2.0	5.0	3

print(df) ### original code

### bfill along columns 

df.bfill(axis=1)

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0

### fillna with the mean of each column (numeric only)
### Per-column mean (most common)

print(df) ### original code

# df.fillna(value=df.mean(numeric_only=True), inplace=True)
df.fillna(value=df.mean(numeric_only=True))

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1
1	2.0	5.0	2
2	1.5	5.0	3

### Advanced; FYI only
### T is the transpose of the DataFrame, which swaps rows and columns. 
#   This allows us to compute the mean across rows instead of columns.
### Per-row mean (fill each row’s NaNs with that row’s mean)
df.T.fillna(value=df.T.mean(numeric_only=True)).T

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0

### Advanced; FYI only
### Per-row mean using apply and a lambda function

df.apply(lambda row: row.fillna(row.mean()), axis=1)

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0

### EXERCISE: Filling Missing Values
#
# Create a DataFrame using the dictionary: {'X': [1, 2, np.nan, 4], 'Y': [np.nan, 2, 3, 4]}
# 1. Fill missing values in column 'X' with the mean of 'X'
# 2. Fill missing values in column 'Y' with forward fill
# 3. Fill all remaining NaN with 0
#
### Your code starts here:




### Your code ends here.

Original DataFrame:
     X    Y
0  1.0  NaN
1  2.0  2.0
2  NaN  3.0
3  4.0  4.0

After filling X with mean:
          X    Y
0  1.000000  NaN
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0

After forward filling Y:
          X    Y
0  1.000000  NaN
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0

After filling remaining with 0:
          X    Y
0  1.000000  0.0
1  2.000000  2.0
2  2.333333  3.0
3  4.000000  4.0

### EXERCISE: Handling Missing Data with dropna() and fillna()
df = pd.DataFrame({'X': [1, 2, np.nan, 4], 'Y': [np.nan, 2, 3, 4]})
### Using the DataFrame df, perform the following steps:
### 1. Check missing values per column with isna().sum()
### 2. Drop rows that have any missing values; compare shapes
### 3. Fill a numeric column's NaN with the column mean
### 4. Try dropna(thresh=2) and observe the difference
### Your code begins here

# 1. Check missing counts

# 2. Drop rows with any NaN

# 3. Fill a numeric column with the mean

# 4. Try thresh

### Your code ends here

1. Missing values:
 X    1
Y    1
dtype: int64

2. Original shape: (4, 2)  →  After dropna(): (2, 2)

3. Filled 'X' NaN with mean (2.33)

4. After dropna(thresh=2) shape: (2, 2)