Missing Data

7.4. Missing Data#

Missing data is common in real-world datasets and can affect analysis, aggregations, and model training. In Pandas, the default way of treating missing values is a sentinel-based missing data scheme. In computer programming, a sentinel value (also referred to as a flag value or dummy data) is a special value used as a condition of termination.

While missing data is typically represented as NaN (Not a Number), which is np.nan in NumPy, a special floating-point value defined in the IEEE 754 standard. Pandas also introduces its own NA scalar (pd.NA), designed to work better with non-numeric data (like strings or booleans) and allow consistent missing-data handling across types.

In summary, for missing values in Pandas, we have:

Sentinel missing value markers:
- None: A Python built-in constant with object dtype. Pandas also treats None as a missing value.
- NaN (np.nan): missing numerical data in float dtypes;
pd.NA: Represents missing in all dtypes (preserves the column’s nullable dtype instead of forcing float).

7.4.1. None as a Sentinel Value#

For some data types, Pandas uses None as a sentinel value. However because None is an object, for numerical arrays, Pandas does not use None as a sentinel. Pay attention to the dtypes in the following operations.

import numpy as np
import pandas as pd

### dtype is int64

arr = np.array([1, 1, 2, 3])
arr.dtype

dtype('int64')

In the following context, NumPy infers the arr elements as Python objects because of None.

### dtype becomes object

arr = np.array([1, None, 2, 3])
arr

array([1, None, 2, 3], dtype=object)

Possible issues of such interpretation includes:

arr.sum()     ### will generate a TypeError

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[279], line 1
----> 1 arr.sum()     ### will generate a TypeError

File ~/workspace/dsm/.venv/lib/python3.13/site-packages/numpy/_core/_methods.py:51, in _sum(a, axis, dtype, out, keepdims, initial, where)
     49 def _sum(a, axis=None, dtype=None, out=None, keepdims=False,
     50          initial=_NoValue, where=True):
---> 51     return umr_sum(a, axis, dtype, out, keepdims, initial, where)

TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'

### set dtype to avoid error
### sum to np.nan

arr = np.array([1, None, 2, 3], dtype=float)
arr.sum()

np.float64(nan)

7.4.2. NaN: Missing Numerical Data#

In contrast, NaN is a special IEEE 754 floating-point value, recognized across systems that follow the standard. Notice that NumPy inferred a native floating-point dtype for this array. Unlike the earlier object-dtype array, this enables fast, vectorized operations executed in compiled code.

### use NaN (np.nan) instead of None
### specify dtype

arr = np.array([1, np.nan, 3, 4], dtype=float)
arr

array([ 1., nan,  3.,  4.])

### use sum, still get nan

np.sum(arr)

np.float64(nan)

### use nansum; it works

arr_sum = np.nansum(arr)
arr_sum

np.float64(8.0)

print(arr_sum)

8.0

A key limitation of NaN, however, is that it’s defined only for floating-point numbers; there’s no native NaN sentinel for integers, strings, or other types.

7.4.3. NaN, None, and NA in Pandas#

Both NaN and None serve as missing-value markers, and Pandas is designed to treat them nearly interchangeably, automatically converting between them when appropriate. For example:

pd.Series( [ 1, np.nan, 2, None ] )

  1.0
  NaN
  2.0
  NaN
dtype: float64

When Pandas/NumPy need a single dtype that can hold all values in an array/Series, they “promote” to a wider/more general dtype.

For dtypes without a native missing-value sentinel, Pandas upcasts the array when an nullable dtype (NA) is requested. For example, assigning np.nan to an integer array promotes it to a floating-point dtype so the missing value can be represented.

A common ladder looks like:

bool => int => float => complex

For Pandas types:

int => float (if NaN needed), or
object/pd.NA

See the demonstration below for how Pandas handle the casting automatically.

### dtype is int

ser = pd.Series(range(3), dtype=int)
ser

  0
  1
  2
dtype: int64

### update to None
### shown as NaN
### dtype is upcast to float

ser[0] = None
ser

  NaN
  1.0
  2.0
dtype: float64

For dtypes without a native missing-value sentinel, Pandas upcasts the array when an nullable dtype (NA) appears. For example, assigning np.nan to an integer array promotes it to a floating-point dtype so the missing value can be represented.

### requesting pd.NA by dtype='Int32'

pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int32')

     1
  <NA>
     2
  <NA>
  <NA>
dtype: Int32

The reason to us pd.NA is because there are times when implicit type casting could become an issue. For example, how to represent a true integer array with missing data. Pandas adds nullable dtypes (NA) to address the issue. These dtypes has capitalization of their names (e.g., pd.Int32 versus np.int32) and are used only when requested for backward compatibility. For example:

### requesting NA: dtype=Int32, not int32

pd.Series([1, np.nan, 2, None, pd.NA], dtype='Int32')

     1
  <NA>
     2
  <NA>
  <NA>
dtype: Int32

In summary, Pandas uses the following upcasting conventions when NA values are introduced.

Type class	Conversion with NAs	NA sentinel value
floating	No change	np.nan
object	No change	None or np.nan
integer	Cast to float64	np.nan
boolean	Cast to object	None or np.nan

7.4.4. Null Value Operations#

Pandas provides convenient tools to detect and handle missing so you can decide whether to:

remove incomplete rows/columns,
impute values, or
use domain-specific strategies,

These tools include:

isnull: Generates a Boolean mask indicating missing values
notnull: Opposite of isnull
dropna: Returns a filtered version of the data
fillna: Returns a copy of the data with missing values filled or imputed

Note that isnull and notnull are aliases of isna(), notna() in Pandas.

7.4.4.1. Check Your Objects#

It’s good practice to inspect df.info(), df.isna().sum(), and consider dtypes before applying fixes to avoid unintended type changes or biased results. Also, they are both functions and methods in Pandas.

### build a DataFrame

import numpy as np
import pandas as pd

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })

7.4.4.1.1. df.info ( )#

### df.info() wil show non-null count

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   A       2 non-null      float64
 1   B       1 non-null      float64
 2   C       3 non-null      int64  
dtypes: float64(2), int64(1)
memory usage: 204.0 bytes

7.4.4.1.2. .sum( )#

### super useful and clear

df_nan_sum = df.isna().sum()
print("Sum of NaN's:")
print(df_nan_sum)

Sum of NaN's:
A    1
B    2
C    0
dtype: int64

7.4.4.1.3. dtypes#

### always know your data types

df.dtypes

A    float64
B    float64
C      int64
dtype: object

7.4.4.2. Detecting Null Values#

isnull()
notnull()

Let’s look at a Pandas Series.

ser = pd.Series([1, np.nan, 'hello', None])
ser

      1
    NaN
  hello
   None
dtype: object

### user as method 

ser.isnull()

  False
   True
  False
   True
dtype: bool

### use as function

pd.isnull(ser)

  False
   True
  False
   True
dtype: bool

ser.notnull()

   True
  False
   True
  False
dtype: bool

pd.notnull(ser)

   True
  False
   True
  False
dtype: bool

### boolean masks as index in Series or DataFrame

ser[ser.notnull()]

0        1
2    hello
dtype: object

Now let’s look at a Pandas DataFrame.

df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

df.isnull()

	A	B	C
0	False	False	False
1	False	True	False
2	True	True	False

df.notnull()

	A	B	C
0	True	True	True
1	True	False	True
2	False	False	True

### boolean masks as index

df[ df.notnull() ]

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

7.4.4.3. Dropping Null Values#

Beyond masking, pandas provides two conveniences: dropna() to remove missing entries. On a Series, their behavior is straightforward:

ser = pd.Series([1, np.nan, 'hello', None])
ser

      1
    NaN
  hello
   None
dtype: object

ser.dropna()

0        1
2    hello
dtype: object

In a DataFrame, we can only drop entire rows or columns.

df = pd.DataFrame(
    [
        [1, np.nan, 2],
        [2, 3, 5],
        [np.nan, 4, 6]
    ]
)

df

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

### dropping rows by default

df.dropna()

	0	1	2
1	2.0	3.0	5

### dropping columns instead

# df.dropna(axis=1)    ### the same as below
df.dropna(axis='columns')

	2
0	2
1	5
2	6

df.dropna(thresh=2)

	0	1	2
0	1.0	NaN	2
1	2.0	3.0	5
2	NaN	4.0	6

7.4.4.3.1. FIlling Null Values#

Instead of dropping missing values, you may want to substitute a valid value—either a constant (e.g., 0) or an estimate via imputation (e.g., mean) or interpolation (estimated values between observed points). While you could do this with a Boolean mask from isna()/isnull(), Pandas offers the dedicated fillna() method, which returns a new object (or can operate in place) with nulls replaced.

import pandas as pd
import numpy as np

ser = pd.Series(
    [1, np.nan, 3, None, 5],
    index=list('abcde'),
    dtype='Int32'
)
ser

a       1
b    <NA>
c       3
d    <NA>
e       5
dtype: Int32

We can fill NA entries with a single value such as 0:

# ser.fillna(0)  ### fill NA entries with a single value, such as zero
ser.fillna(value=0)

a    1
b    0
c    3
d    0
e    5
dtype: Int32

### We can specify a forward fill to propagate the previous value forward:
ser.ffill()

a    1
b    1
c    3
d    3
e    5
dtype: Int32

### Or we can specify a backward fill to propagate the next values backward:
ser.bfill()

a    1
b    3
c    3
d    5
e    5
dtype: Int32

For DataFrames, the options are similar: You can additionally specify the axis (rows or columns) along which the fill should be applied:

df = pd.DataFrame({
    'A' : [ 1, 2, np.nan ],
    'B' : [ 5, np.nan, np.nan],
    'C' : [ 1, 2, 3]
    })
df

	A	B	C
0	1.0	5.0	1
1	2.0	NaN	2
2	NaN	NaN	3

### fill a column (Series) with the mean of that column

df['A'].fillna(value=df['A'].mean())

  1.0
  2.0
  1.5
Name: A, dtype: float64

print(df) ### original code

### ffill along rows (default)

df.ffill()

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1
1	2.0	5.0	2
2	2.0	5.0	3

print(df) ### original code

### bfill along columns 

df.bfill(axis=1)

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0

### fillna with the mean of each column (numeric only)
### Per-column mean (most common)

print(df) ### original code

# df.fillna(value=df.mean(numeric_only=True), inplace=True)
df.fillna(value=df.mean(numeric_only=True))

     A    B  C
1.0  5.0  1
2.0  NaN  2
NaN  NaN  3

	A	B	C
0	1.0	5.0	1
1	2.0	5.0	2
2	1.5	5.0	3

### Advanced; FYI only
### Per-row mean (fill each row’s NaNs with that row’s mean)
df.T.fillna(value=df.T.mean(numeric_only=True)).T

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0

### Advanced; FYI only
### Per-row mean using apply and a lambda function

df.apply(lambda row: row.fillna(row.mean()), axis=1)

	A	B	C
0	1.0	5.0	1.0
1	2.0	2.0	2.0
2	3.0	3.0	3.0