Estimation

17. Estimation#

In the previous chapter, we began to develop ways of inferential thinking. In particular, we learned how to use data to decide between two hypotheses about the world. But often we just want to know how big something is.

For example, during an election year, we may want to know the percentage of voters who favor a particular candidate. To assess the current state of the economy, we might be interested in the median annual income of households in the United States.

In this chapter, we will develop a way to estimate an unknown parameter. Remember that a parameter is a numerical value associated with a population.

To figure out the value of a parameter, we need data. If we have the relevant data for the entire population, we can simply calculate the parameter.

However, if the population is very large – for example, if it comprises all the households in the United States – then gathering data from the entire population might be too expensive and time-consuming. In such situations, data scientists rely on random sampling from the population.

This raises a question of inference: How can justifiable conclusions be drawn about the unknown parameter based on the data from the random sample? We will answer this question by using inferential thinking.

A statistic based on a random sample can be a reasonable estimate of an unknown parameter in the population. For example, you might want to use the median annual income of sampled households as an estimate of the median annual income of all households in the U.S.

However, the value of any statistic depends on the sample, which is based on random draws. So every time data scientists come up with an estimate based on a random sample, they are faced with a question:

“How different could this estimate have been, if the sample had come out differently?”

In this chapter, you will learn one way of answering this question. The answer will provide you with the tools to estimate a numerical parameter and quantify the error in your estimate.

We will start with a preliminary about percentiles. The most famous percentile is the median, often used in summaries of income data. Other percentiles will be important in the method of estimation that we are about to develop. So we will start by defining percentiles carefully.