17.3. Confidence Intervals#

We have developed a method for estimating a parameter by using random sampling and the bootstrap. Our method produces an interval of estimates, to account for chance variability in the random sample. By providing an interval of estimates instead of just one estimate, we give ourselves some wiggle room.

In the previous example we saw that our process of estimation produced a good interval about 95% of the time, a “good” interval being one that contains the parameter. We say that we are 95% confident that the process results in a good interval. Our interval of estimates is called a 95% confidence interval for the parameter, and 95% is called the confidence level of the interval.

The method is called the bootstrap percentile method because the interval is formed by picking off two percentiles of the bootstrapped estimates.

The situation in the previous example was a bit unusual. Because we happened to know the value of the parameter, we were able to check whether an interval was good or a dud, and this in turn helped us to see that our process of estimation captured the parameter about 95 out of every 100 times we used it.

But usually, data scientists don’t know the value of the parameter. That is the reason they want to estimate it in the first place. In such situations, they provide an interval of estimates for the unknown parameter by using methods like the one we have developed. Because of statistical theory and demonstrations like the one we have seen, data scientists can be confident that their process of generating the interval results in a good interval a known percent of the time.

17.3.1. Estimating a Population Median#

We will now use the bootstrap method to estimate an unknown population median. You have encountered the dataset before. It comes from a sample of newborns in a large hospital system. we will treat it as if it were a simple random sample though the sampling was done in multiple stages. Stat Labs by Deborah Nolan and Terry Speed has details about a larger dataset from which this set is drawn.

The table births contains the following variables for mother-baby pairs: the baby’s birth weight in ounces, the number of gestational days (the number of days the mother was pregnant), the mother’s age in completed years, the mother’s height in inches, pregnancy weight in pounds, and whether or not the mother smoked during pregnancy.

births = pd.read_csv(path_data + 'baby.csv')
births
Birth Weight Gestational Days Maternal Age Maternal Height Maternal Pregnancy Weight Maternal Smoker
0 120 284 27 62 100 False
1 113 282 33 64 135 False
2 128 279 28 64 115 True
3 108 282 23 67 125 True
4 136 286 25 62 93 False
... ... ... ... ... ... ...
1169 113 275 27 60 100 False
1170 128 265 24 67 120 False
1171 130 291 30 65 150 True
1172 125 281 21 65 110 False
1173 117 297 38 65 129 False

1174 rows × 6 columns

Birth weight is an important factor in the health of a newborn infant. Smaller babies tend to need more medical care in their first days than larger newborns. It is therefore helpful to have an estimate of birth weight before the baby is born. One way to do this is to examine the relationship between birth weight and the number of gestational days.

A simple measure of this relationship is the ratio of birth weight to the number of gestational days. The table ratios contains the first two columns of baby, as well as a column of the ratios. The first entry in that column was calculated as follows:

\[ \frac{120~\mbox{ounces}}{284~\mbox{days}} ~\approx ~ 0.4225~ \mbox{ounces per day} \]
ratios = births[['Birth Weight', 'Gestational Days']].copy()

ratios['Ratio BW:GD'] = births['Birth Weight']/births['Gestational Days']
ratios
Birth Weight Gestational Days Ratio BW:GD
0 120 284 0.422535
1 113 282 0.400709
2 128 279 0.458781
3 108 282 0.382979
4 136 286 0.475524
... ... ... ...
1169 113 275 0.410909
1170 128 265 0.483019
1171 130 291 0.446735
1172 125 281 0.444840
1173 117 297 0.393939

1174 rows × 3 columns

print(ratios["Ratio BW:GD"].min())
print(ratios["Ratio BW:GD"].max())
0.23673469387755103
0.7837837837837838

Here is a histogram of the ratios.

fig, ax = plt.subplots(1, 2, figsize = (12, 4))

ax[0].hist(ratios['Ratio BW:GD'], ec='white', density=True)

# xbins = np.arange(0.2, 0.81, 0.1)              ### this is to show bins config
# xbins = np.arange(0.2, 0.81, 0.05)           ### note the width of the bin
xbins = np.arange(0.2, 0.81, 0.01)           ### note the width of the bin
ax[1].hist(ratios['Ratio BW:GD'], ec='white', bins=xbins, density=True)  ### also note ethe subtle difference between ax[0] and ax[1]

ax[0].tick_params(axis='both', labelsize=10)
ax[1].tick_params(axis='both', labelsize=10)
y_vals = ax[0].get_yticks()
ax[0].set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
ax[1].set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
ax[0].set_xlabel('Ratio BW:GD', fontsize=12)
ax[0].set_ylabel('Percent per unit', fontsize=12)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/1887246707.py:13: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax[0].set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/1887246707.py:14: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax[1].set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
Text(0, 0.5, 'Percent per unit')
../../_images/538223e67107b9f37325ad564709cfa4e3a4038b0c205dcc3e0a4b7657328513.png

At first glance the histogram looks quite symmetric, with the density at its maximum over the interval 0.4 ounces per day to 0.45 ounces per day. But a closer look reveals that some of the ratios were quite large by comparison. The maximum value of the ratios was just over 0.78 ounces per day, almost double the typical value.

ratios_1 = ratios.sort_values(by=['Ratio BW:GD'], ascending=False)

ratios_1.iloc[[0]]
Birth Weight Gestational Days Ratio BW:GD
238 116 148 0.783784

The median gives a sense of the typical ratio because it is unaffected by the very large or very small ratios. The median ratio in the sample is about 0.429 ounces per day.

np.median(ratios.iloc[:, 2]).item()
0.42907801418439717

But what was the median in the population? We don’t know, so we will estimate it.

Our method will be exactly the same as in the previous section. We will bootstrap the sample 5,000 times resulting in 5,000 estimates of the median. Our 95% confidence interval will be the “middle 95%” of all of our estimates.

17.3.1.1. Constructing a Bootstrap Confidence Interval#

We will start by defining a function one_bootstrap_median. It will bootstrap the sample and return one the median ratio in the bootstrapped sample.

def one_bootstrap_median():
    resample = ratios.sample(len(ratios), replace=True)
    return np.percentile(resample['Ratio BW:GD'], 50 ).item()
    # return np.median(resample['Ratio BW:GD']).item()       ### the same as the line above

Run the cell below to see how the bootstrapped ratios vary. Remember that each of them is an estimate of the unknown ratio in the population.

one_bootstrap_median()
0.43016786903440624

Now we can use a for loop to generate 5000 bootstrapped medians.

### generate medians from 5000 bootstrap samples

num_repetitions = 5000
bstrap_medians = np.array([])
for i in np.arange(num_repetitions):
    bstrap_medians = np.append(bstrap_medians, one_bootstrap_median())
### get the endpoints of the 95% confidence interval

left = np.percentile(bstrap_medians, 2.5, method='nearest')
right = np.percentile(bstrap_medians, 97.5, method='nearest')

np.array([left, right])
array([0.42545455, 0.43257503])

The 95% confidence interval goes from about 0.425 ounces per day to about 0.433 ounces per day. We estimate that the median “birth weight to gestational days” ratio in the population is somewhere in the interval of 0.425 ounces per day to 0.433 ounces per day.

The estimate of 0.429 based on the original sample happens to be halfway between the two ends of the interval, though that need not be true in general.

To visualize our results, let us draw the empirical histogram of our bootstrapped medians and place the confidence interval on the horizontal axis.

resampled_medians = pd.DataFrame({
    'Bootstrap Sample Median': bstrap_medians
})

### use pandas df.hist will give you the same thing, less the details ###
# resampled_medians['Bootstrap Sample Median'].hist(bins=15, ec='white', figsize=(5, 4), density=True)
# plt.plot([left, right], [0,0], color='yellow', lw=8)

fig, ax = plt.subplots(figsize = (5, 4))
ax.hist(resampled_medians, bins=15, density=True, ec='white')
plt.plot([left, right], [0,0], color='yellow', lw=8)

plt.xticks(rotation=90, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
plt.xlabel('Bootstrap Sample Median', fontsize=12)
plt.ylabel('Percent per unit', fontsize=12)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/781575943.py:16: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])
Text(0, 0.5, 'Percent per unit')
../../_images/c824d91df525dbe03ca5eaa141d0d1e65f6e9b74444b496c35d767afac11edac.png

This histogram and interval resemble those we drew in the previous section, with one big difference – there is no green dot showing where the parameter is. We don’t know where that dot should be, or whether it is even in the interval.

We just have an interval of estimates. It is a 95% confidence interval of estimates, because the process that generates it produces a good interval about 95% of the time. That certainly beats guessing the ratio at random!

Keep in mind that this interval is an approximate 95% confidence interval. There are many approximations involved in its computation. The approximation is not bad, but it is not exact.

17.3.2. Estimating a Population Mean#

What we have done for medians can also be applied to averages. Suppose we want to estimate the average age of the mothers in the population. A natural estimate is the average age of the mothers in the sample. Here is the distribution of their ages, along with their average age, which was approximately 27.2 years.

fig, ax = plt.subplots(figsize = (5, 4))

ax.hist(births['Maternal Age'], density=True, ec='white')

plt.xticks(rotation=90, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
ax.set_ylabel('Percent per unit', fontsize=11)
ax.set_xlabel('Maternal Age', fontsize=11)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/3047416111.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
Text(0.5, 0, 'Maternal Age')
../../_images/41b50ac8c835f1fd93cdd4ced9332fc0d83573331b73e688f8c73088c2207319.png
np.average(births['Maternal Age']).item()
27.228279386712096

What was the average age of the mothers in the population? We don’t know the value of this parameter.

Let’s estimate the unknown parameter by the bootstrap method. To do this, we will adapt the code for bootstrap_median to instead define the function bootstrap_mean. The code is the same except that the statistics are means (that is, averages) instead of medians, and are collected in an array called bstrap_means instead of bstrap_medians.

def one_bootstrap_mean():
    resample = births.sample(len(births), replace=True)
    return np.average(resample['Maternal Age'])
### generate means from 5000 bootstrap samples

num_repetitions = 5000
bstrap_means = np.array([])
for i in np.arange(num_repetitions):
    bstrap_means = np.append(bstrap_means, one_bootstrap_mean())
### get the endpoints of the 95% confidence interval

left = np.percentile(bstrap_means, 2.5, method='nearest').round(2)
right = np.percentile(bstrap_means, 97.5, method='nearest').round(2)

np.array([left, right])
array([26.9 , 27.56])

The 95% confidence interval spans approximately 26.9 to 27.6 years. That is, we estimate that the average age of mothers in the population is somewhere in the interval of 26.9 to 27.6 years.

Notice how close the two ends are to the average of about 27.2 years in the original sample. The sample size is very large – 1,174 mothers – and so the sample averages don’t vary much. We will explore this observation further in the next chapter.

The empirical histogram of the 5,000 bootstrapped mean ages is shown below, along with the 95% confidence interval for the population mean age.

resampled_means = pd.DataFrame({
    'Bootstrap Sample Mean': bstrap_means
})

fig, ax = plt.subplots(figsize = (5, 3))
ax.hist(resampled_means['Bootstrap Sample Mean'], bins=15, density=True, ec='white')
ax.plot(np.array([left, right]), np.array([0,0]), color='yellow', lw=8, zorder=10)

plt.xticks(rotation=45, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
ax.set_ylabel('Percent per unit', fontsize=12)
ax.set_xlabel('Bootstrap Sample Median', fontsize=12)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/1386445648.py:12: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
Text(0.5, 0, 'Bootstrap Sample Median')
../../_images/49042f9ce2e8515306f8d6a358bbefc43ad054d4cd5a35aad809ccb92cd6a2d5.png

Once again, the average of the original sample (27.23 years) is close to the center of the interval. That’s not very surprising, because each bootstrapped sample is drawn from that same original sample. The averages of the bootstrapped samples are about symmetrically distributed on either side of the average of the sample from which they were drawn.

Notice also that the empirical histogram of the resampled means has roughly a symmetric bell shape, even though the histogram of the sampled ages was not symmetric at all as seen below:

fig, ax = plt.subplots(figsize = (5, 4))

ax.hist(births['Maternal Age'], bins=15, density=True, ec='white')

plt.xticks(rotation=45, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
ax.set_ylabel('Percent per unit', fontsize=12)
ax.set_xlabel('Maternal Age', fontsize=12)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/347770336.py:8: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
Text(0.5, 0, 'Maternal Age')
../../_images/7b6eacb85b3484105c71f57bf812d97cf752439f76056c1c3769c3605167c2ca.png

This is a consequence of the Central Limit Theorem (CLT) of probability and statistics. CLT says that if you repeatedly take random samples from any population with any distribution (skewed, uniform, binomial, real-world messy data), and compute the sample mean for each sample, then: The distribution of those sample means will approach a normal (bell-shaped) distribution as the sample size n becomes large (typically n ≥ 30 is enough).

17.3.3. An 80% Confidence Interval#

You can use the bootstrapped sample means to construct an interval of any level of confidence. For example, to construct an 80% confidence interval for the mean age in the population, you would take the “middle 80%” of the resampled means. So you would want 10% of the distribution in each of the two tails, and hence the endpoints would be the 10th and 90th percentiles of the resampled means.

left_80 = np.percentile(bstrap_means, 10, interpolation='nearest')
right_80 = np.percentile(bstrap_means, 90, interpolation='nearest')
np.array([left_80, right_80])
array([27.00851789, 27.43781942])
# resampled_means.hist(bins=15)
# plt.plot([left_80, right_80], [0, 0], color='yellow', lw=8);

fig, ax = plt.subplots(figsize = (5, 3))
ax.hist(resampled_means, bins=15, density=True, ec='white')
ax.plot(np.array([left_80, right_80]), np.array([0,0]), color='yellow', lw=8, zorder=10)

plt.xticks(rotation=45, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
ax.set_ylabel('Percent per unit', fontsize=12)
ax.set_xlabel('Bootstrap Sample Mean', fontsize=12)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/3946457152.py:11: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
Text(0.5, 0, 'Bootstrap Sample Mean')
../../_images/a5ed0d61d0f64816c5fa644c440d8f1c3c2d7671160541932161c526a0413ab5.png

This 80% confidence interval is much shorter than the 95% confidence interval. It only goes from about 27.0 years to about 27.4 years. While that’s a tight set of estimates, you know that this process only produces a good interval about 80% of the time.

The earlier process produced a wider interval, but we had more confidence in the process that generated it.

To get a narrow confidence interval at a high level of confidence, you’ll have to start with a larger sample. We’ll see why in the next chapter.

17.3.4. Estimating a Population Proportion#

In the sample, 39% of the mothers smoked during pregnancy.

len(births[births['Maternal Smoker'] == True]) / len(births)
0.3909710391822828

Remember that a proportion is an average of zeros and ones. So the proportion of mothers who smoked could also be calculated using array operations as follows.

smoking = births['Maternal Smoker']
(np.count_nonzero(smoking) / len(smoking)).item()
0.3909710391822828

What percent of mothers in the population smoked during pregnancy? This is an unknown parameter which we can estimate by a bootstrap confidence interval. The steps are analogous to those we took to estimate the population median and mean.

In a process that is now familiar, will start by defining a function one_bootstrap_proportion that bootstraps the sample and returns the proportion of smokers in the bootstrapped sample. Then we will call the function multiple times using a for loop, and get the 2.5th perentile and 97.5th percentiles of the bootstrapped proportions.

def one_bootstrap_proportion():
    resample = births.sample(len(births), replace=True)
    smoking = resample['Maternal Smoker']
    return np.count_nonzero(smoking) / len(smoking)
### generate proportions from 5000 bootstrap samples

bstrap_proportions = np.array([])
num_repetitions = 5000
for i in np.arange(num_repetitions):
    bstrap_proportions = np.append(bstrap_proportions, one_bootstrap_proportion())
### get the endpoints of the 95% confidence interval

left = np.percentile(bstrap_proportions, 2.5, method='nearest')
right = np.percentile(bstrap_proportions, 97.5, method='nearest')

np.array([left, right])
array([0.3637138 , 0.41908007])

The confidence interval goes from about 36% to about 42%.

resampled_proportions = pd.DataFrame({
    'Bootstrap Sample Proportion': bstrap_proportions
})

# resampled_proportions.hist(bins=15)

fig, ax = plt.subplots(figsize = (5, 4))
ax.hist(resampled_proportions, bins=15, density=True, ec='white')
ax.plot([left, right], [0, 0], color='yellow', lw=8)

plt.xticks(rotation=90, fontsize=10)
plt.yticks(fontsize=10)
y_vals = ax.get_yticks()
ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
ax.set_ylabel('Percent per unit', fontsize=11)
ax.set_xlabel('Bootstrap Sample Mean', fontsize=11)
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_89968/553625568.py:14: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
  ax.set_yticklabels(['{:g}'.format(x * 100) for x in y_vals])   ### density sets tick to probability
Text(0.5, 0, 'Bootstrap Sample Mean')
../../_images/edbf214e4b320e331bd66867918e3ef95a979f49c192545f9cb5a7191fb380cd.png

17.3.5. Care in Using the Bootstrap Percentile Method#

The bootstrap is an elegant and powerful method. Before using it, it is important to keep some points in mind.

  • Start with a large random sample. If you don’t, the method might not work. Its success is based on large random samples (and hence also resamples from the sample) resembling the population. The Law of Averages says that this is likely to be true provided the random sample is large.

  • To approximate the probability distribution of a statistic, it is a good idea to replicate the resampling procedure as many times as possible. A few thousand replications will result in decent approximations to the distribution of sample median, especially if the distribution of the population has one peak and is fairly symmetric. We used 5,000 replications in our examples but would recommend 10,000 in general.

  • The bootstrap percentile method works well for estimating the population median or mean based on a large random sample. However, it has limitations, as do all methods of estimation. For example, it is not expected to do well in the following situations.

    • The goal is to estimate the minimum or maximum value in the population, or a very low or very high percentile, or parameters that are greatly influenced by rare elements of the population.

    • The probability distribution of the statistic is not roughly bell shaped.

    • The original sample is very small, say less than 10 or 15.