7.4. Missing Data#
Missing data is common in real-world datasets and can affect analysis, aggregations, and model training. In Pandas, the default way of treating missing values is a sentinel-based missing data scheme. In computer programming, a sentinel value (also referred to as a flag value or dummy data) is a special value used as a condition of termination.
While missing data is typically represented as NaN
(Not a Number), which is np.nan
in NumPy, a special floating-point value defined in the IEEE 754 standard. Pandas also introduces its own NA
scalar (pd.NA
), designed to work better with non-numeric data (like strings or booleans) and allow consistent missing-data handling across types.
In summary, for missing values in Pandas, we have:
Sentinel missing value markers:
None
: A Python built-in constant with object dtype. Pandas also treatsNone
as a missing value.NaN
(np.nan
): missing numerical data in float dtypes;
pd.NA
: Represents missing in all dtypes (preserves the column’s nullable dtype instead of forcing float).
7.4.1. None as a Sentinel Value#
For some data types, Pandas uses None
as a sentinel value. However because None
is an object, for numerical arrays, Pandas does not use None
as a sentinel. Pay attention to the dtypes in the following operations.
import numpy as np
import pandas as pd
### dtype is int64
arr = np.array([1, 1, 2, 3])
arr.dtype
dtype('int64')
In the following context, NumPy infers the arr elements as Python objects because of None
.
### dtype becomes object
arr = np.array([1, None, 2, 3])
arr
array([1, None, 2, 3], dtype=object)
Possible issues of such interpretation includes:
arr.sum() ### will generate a TypeError
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[279], line 1
----> 1 arr.sum() ### will generate a TypeError
File ~/workspace/dsm/.venv/lib/python3.13/site-packages/numpy/_core/_methods.py:51, in _sum(a, axis, dtype, out, keepdims, initial, where)
49 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
50 initial=_NoValue, where=True):
---> 51 return umr_sum(a, axis, dtype, out, keepdims, initial, where)
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
### set dtype to avoid error
### sum to np.nan
arr = np.array([1, None, 2, 3], dtype=float)
arr.sum()
np.float64(nan)
7.4.2. NaN: Missing Numerical Data#
In contrast, NaN
is a special IEEE 754 floating-point value, recognized across systems that follow the standard. Notice that NumPy inferred a native floating-point dtype for this array. Unlike the earlier object-dtype array, this enables fast, vectorized operations executed in compiled code.
### use NaN (np.nan) instead of None
### specify dtype
arr = np.array([1, np.nan, 3, 4], dtype=float)
arr
array([ 1., nan, 3., 4.])
### use sum, still get nan
np.sum(arr)
np.float64(nan)
### use nansum; it works
arr_sum = np.nansum(arr)
arr_sum
np.float64(8.0)
print(arr_sum)
8.0
A key limitation of NaN
, however, is that it’s defined only for floating-point numbers; there’s no native NaN sentinel for integers, strings, or other types.
7.4.3. NaN, None, and NA in Pandas#
Both NaN
and None
serve as missing-value markers, and Pandas is designed to treat them nearly interchangeably, automatically converting between them when appropriate. For example:
pd.Series( [ 1, np.nan, 2, None ] )
0 1.0
1 NaN
2 2.0
3 NaN
dtype: float64
When Pandas/NumPy need a single dtype that can hold all values in an array/Series, they “promote” to a wider/more general dtype.
For dtypes without a native missing-value sentinel, Pandas upcasts the array when an nullable dtype (NA
) is requested. For example, assigning np.nan to an integer array promotes it to a floating-point dtype so the missing value can be represented.
A common ladder looks like:
bool => int => float => complex
For Pandas types:
int => float (if NaN needed), or
object/pd.NA
See the demonstration below for how Pandas handle the casting automatically.
### dtype is int
ser = pd.Series(range(3), dtype=int)
ser
0 0
1 1
2 2
dtype: int64
### update to None
### shown as NaN
### dtype is upcast to float
ser[0] = None
ser
0 NaN
1 1.0
2 2.0
dtype: float64
For dtypes without a native missing-value sentinel, Pandas upcasts the array when an nullable dtype (NA
) appears. For example, assigning np.nan to an integer array promotes it to a floating-point dtype so the missing value can be represented.
### requesting pd.NA by dtype='Int32'
pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int32')
0 1
1 <NA>
2 2
3 <NA>
4 <NA>
dtype: Int32
The reason to us pd.NA
is because there are times when implicit type casting could become an issue. For example, how to represent a true integer array with missing data. Pandas adds nullable dtypes
(NA
) to address the issue. These dtypes has capitalization of their names (e.g., pd.Int32 versus np.int32) and are used only when requested for backward compatibility. For example:
### requesting NA: dtype=Int32, not int32
pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int32')
0 1
1 <NA>
2 2
3 <NA>
4 <NA>
dtype: Int32
In summary, Pandas uses the following upcasting conventions when NA values are introduced.
Type class |
Conversion with NAs |
NA sentinel value |
---|---|---|
floating |
No change |
np.nan |
object |
No change |
None or np.nan |
integer |
Cast to float64 |
np.nan |
boolean |
Cast to object |
None or np.nan |
7.4.4. Null Value Operations#
Pandas provides convenient tools to detect and handle missing so you can decide whether to:
remove incomplete rows/columns,
impute values, or
use domain-specific strategies,
These tools include:
isnull
: Generates a Boolean mask indicating missing valuesnotnull
: Opposite of isnulldropna
: Returns a filtered version of the datafillna
: Returns a copy of the data with missing values filled or imputed
Note that isnull
and notnull
are aliases of isna()
, notna()
in Pandas.
7.4.4.1. Check Your Objects#
It’s good practice to inspect df.info()
, df.isna().sum()
, and consider dtypes
before applying fixes to avoid unintended type changes or biased results. Also, they are both functions and methods in Pandas.
### build a DataFrame
import numpy as np
import pandas as pd
df = pd.DataFrame({
'A' : [ 1, 2, np.nan ],
'B' : [ 5, np.nan, np.nan],
'C' : [ 1, 2, 3]
})
7.4.4.1.1. df.info ( )#
### df.info() wil show non-null count
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 A 2 non-null float64
1 B 1 non-null float64
2 C 3 non-null int64
dtypes: float64(2), int64(1)
memory usage: 204.0 bytes
7.4.4.1.2. .sum( )#
### super useful and clear
df_nan_sum = df.isna().sum()
print("Sum of NaN's:")
print(df_nan_sum)
Sum of NaN's:
A 1
B 2
C 0
dtype: int64
7.4.4.1.3. dtypes#
### always know your data types
df.dtypes
A float64
B float64
C int64
dtype: object
7.4.4.2. Detecting Null Values#
isnull()
notnull()
Let’s look at a Pandas Series.
ser = pd.Series([1, np.nan, 'hello', None])
ser
0 1
1 NaN
2 hello
3 None
dtype: object
### user as method
ser.isnull()
0 False
1 True
2 False
3 True
dtype: bool
### use as function
pd.isnull(ser)
0 False
1 True
2 False
3 True
dtype: bool
ser.notnull()
0 True
1 False
2 True
3 False
dtype: bool
pd.notnull(ser)
0 True
1 False
2 True
3 False
dtype: bool
### boolean masks as index in Series or DataFrame
ser[ser.notnull()]
0 1
2 hello
dtype: object
Now let’s look at a Pandas DataFrame.
df
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
2 | NaN | NaN | 3 |
df.isnull()
A | B | C | |
---|---|---|---|
0 | False | False | False |
1 | False | True | False |
2 | True | True | False |
df.notnull()
A | B | C | |
---|---|---|---|
0 | True | True | True |
1 | True | False | True |
2 | False | False | True |
### boolean masks as index
df[ df.notnull() ]
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
2 | NaN | NaN | 3 |
7.4.4.3. Dropping Null Values#
Beyond masking, pandas provides two conveniences: dropna()
to remove missing entries. On a Series, their behavior is straightforward:
ser = pd.Series([1, np.nan, 'hello', None])
ser
0 1
1 NaN
2 hello
3 None
dtype: object
ser.dropna()
0 1
2 hello
dtype: object
In a DataFrame, we can only drop entire rows or columns.
df = pd.DataFrame(
[
[1, np.nan, 2],
[2, 3, 5],
[np.nan, 4, 6]
]
)
df
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
### dropping rows by default
df.dropna()
0 | 1 | 2 | |
---|---|---|---|
1 | 2.0 | 3.0 | 5 |
### dropping columns instead
# df.dropna(axis=1) ### the same as below
df.dropna(axis='columns')
2 | |
---|---|
0 | 2 |
1 | 5 |
2 | 6 |
df.dropna(thresh=2)
0 | 1 | 2 | |
---|---|---|---|
0 | 1.0 | NaN | 2 |
1 | 2.0 | 3.0 | 5 |
2 | NaN | 4.0 | 6 |
7.4.4.3.1. FIlling Null Values#
Instead of dropping missing values, you may want to substitute a valid value—either a constant (e.g., 0) or an estimate via imputation (e.g., mean) or interpolation (estimated values between observed points). While you could do this with a Boolean mask from isna()
/isnull()
, Pandas offers the dedicated fillna()
method, which returns a new object (or can operate in place) with nulls replaced.
import pandas as pd
import numpy as np
ser = pd.Series(
[1, np.nan, 3, None, 5],
index=list('abcde'),
dtype='Int32'
)
ser
a 1
b <NA>
c 3
d <NA>
e 5
dtype: Int32
We can fill NA entries with a single value such as 0:
# ser.fillna(0) ### fill NA entries with a single value, such as zero
ser.fillna(value=0)
a 1
b 0
c 3
d 0
e 5
dtype: Int32
### We can specify a forward fill to propagate the previous value forward:
ser.ffill()
a 1
b 1
c 3
d 3
e 5
dtype: Int32
### Or we can specify a backward fill to propagate the next values backward:
ser.bfill()
a 1
b 3
c 3
d 5
e 5
dtype: Int32
For DataFrames, the options are similar: You can additionally specify the axis (rows or columns) along which the fill should be applied:
df = pd.DataFrame({
'A' : [ 1, 2, np.nan ],
'B' : [ 5, np.nan, np.nan],
'C' : [ 1, 2, 3]
})
df
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | NaN | 2 |
2 | NaN | NaN | 3 |
### fill a column (Series) with the mean of that column
df['A'].fillna(value=df['A'].mean())
0 1.0
1 2.0
2 1.5
Name: A, dtype: float64
print(df) ### original code
### ffill along rows (default)
df.ffill()
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | 5.0 | 2 |
2 | 2.0 | 5.0 | 3 |
print(df) ### original code
### bfill along columns
df.bfill(axis=1)
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1.0 |
1 | 2.0 | 2.0 | 2.0 |
2 | 3.0 | 3.0 | 3.0 |
### fillna with the mean of each column (numeric only)
### Per-column mean (most common)
print(df) ### original code
# df.fillna(value=df.mean(numeric_only=True), inplace=True)
df.fillna(value=df.mean(numeric_only=True))
A B C
0 1.0 5.0 1
1 2.0 NaN 2
2 NaN NaN 3
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1 |
1 | 2.0 | 5.0 | 2 |
2 | 1.5 | 5.0 | 3 |
### Advanced; FYI only
### Per-row mean (fill each row’s NaNs with that row’s mean)
df.T.fillna(value=df.T.mean(numeric_only=True)).T
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1.0 |
1 | 2.0 | 2.0 | 2.0 |
2 | 3.0 | 3.0 | 3.0 |
### Advanced; FYI only
### Per-row mean using apply and a lambda function
df.apply(lambda row: row.fillna(row.mean()), axis=1)
A | B | C | |
---|---|---|---|
0 | 1.0 | 5.0 | 1.0 |
1 | 2.0 | 2.0 | 2.0 |
2 | 3.0 | 3.0 | 3.0 |