Comparing Two Samples

16. Comparing Two Samples#

We have seen several examples of assessing whether a single sample looks like random draws from a specified chance model.

Did the Alameda County jury panels look like a random sample from the population of eligible jurors?
Did the pea plants that Mendel grew have colors that were consistent with the chances he specified in his model?

In all of these cases there was just one random sample, and we were trying to decide how it had been generated. But often, data scientists have to compare two random samples with each other. For example, they might have to compare the outcomes of patients who have been assigned at random to a treatment group and a control group. Or they might have randomized internet users to receive two different versions of a website, after which they would want to compare the actions of the two random groups.

In this chapter, we develop a way of using Python to compare two random samples and answer questions about the similarities and differences between them. You will see that the methods we develop have diverse applications. Our examples are from medicine and public health as well as football!

First of all, let us review the random sampling techniques we have covered so far.

path_data = '../../data/'

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

16.1. Random Sampling in Python#

This section summarizes the methods for sampling at random using Python:

pd.sample()
np.choice()
sample_proportions

16.1.1. Sampling a DataFrame#

If you are sampling from a population of individuals whose data are represented in the rows of a dataframe, then you can use the Pandas method sample to randomly select rows of the table. That is, you can use sample to select a random sample of individuals.

By default, pd.sample() draws uniformly at random without replacement. So, for a natural model for chance experiments, such as rolling a die, we need to set replace=True.

faces = np.arange(1, 7)
die = pd.DataFrame({
    'Face': faces})
die

	Face
0	1
1	2
2	3
3	4
4	5
5	6

Run the cell below to simulate 7 rolls of a die.

die.sample(7, replace=True)      ### default replace=False

	Face
4	5
0	1
5	6
3	4
4	5
1	2
3	4

Sometimes it is more natural to sample individuals at random without replacement. This is called a simple random sample. The argument replace=False allows you to do this.

actors = pd.read_csv(path_data + 'actors.csv')

print(len(actors))
actors

### showing all 50 rows because it's less than 60

	Actor	Total Gross	Number of Movies	Average per Movie	#1 Movie	Gross
0	Harrison Ford	4871.7	41	118.8	Star Wars: The Force Awakens	936.7
1	Samuel L. Jackson	4772.8	69	69.2	The Avengers	623.4
2	Morgan Freeman	4468.3	61	73.3	The Dark Knight	534.9
3	Tom Hanks	4340.8	44	98.7	Toy Story 3	415.0
4	Robert Downey, Jr.	3947.3	53	74.5	The Avengers	623.4
5	Eddie Murphy	3810.4	38	100.3	Shrek 2	441.2
6	Tom Cruise	3587.2	36	99.6	War of the Worlds	234.3
7	Johnny Depp	3368.6	45	74.9	Dead Man's Chest	423.3
8	Michael Caine	3351.5	58	57.8	The Dark Knight	534.9
9	Scarlett Johansson	3341.2	37	90.3	The Avengers	623.4
10	Gary Oldman	3294.0	38	86.7	The Dark Knight	534.9
11	Robin Williams	3279.3	49	66.9	Night at the Museum	250.9
12	Bruce Willis	3189.4	60	53.2	Sixth Sense	293.5
13	Stellan Skarsgard	3175.0	43	73.8	The Avengers	623.4
14	Anthony Daniels	3162.9	7	451.8	Star Wars: The Force Awakens	936.7
15	Ian McKellen	3150.4	31	101.6	Return of the King	377.8
16	Will Smith	3149.1	24	131.2	Independence Day	306.2
17	Stanley Tucci	3123.9	50	62.5	Catching Fire	424.7
18	Matt Damon	3107.3	39	79.7	The Martian	228.4
19	Robert DeNiro	3081.3	79	39.0	Meet the Fockers	279.3
20	Cameron Diaz	3031.7	34	89.2	Shrek 2	441.2
21	Liam Neeson	2942.7	63	46.7	The Phantom Menace	474.5
22	Andy Serkis	2890.6	23	125.7	Star Wars: The Force Awakens	936.7
23	Don Cheadle	2885.4	34	84.9	Avengers: Age of Ultron	459.0
24	Ben Stiller	2827.0	37	76.4	Meet the Fockers	279.3
25	Helena Bonham Carter	2822.0	36	78.4	Harry Potter / Deathly Hallows (P2)	381.0
26	Orlando Bloom	2815.8	17	165.6	Dead Man's Chest	423.3
27	Woody Harrelson	2815.8	50	56.3	Catching Fire	424.7
28	Cate Blanchett	2802.6	39	71.9	Return of the King	377.8
29	Julia Roberts	2735.3	42	65.1	Ocean's Eleven	183.4
30	Elizabeth Banks	2726.3	35	77.9	Catching Fire	424.7
31	Ralph Fiennes	2715.3	36	75.4	Harry Potter / Deathly Hallows (P2)	381.0
32	Emma Watson	2681.9	17	157.8	Harry Potter / Deathly Hallows (P2)	381.0
33	Tommy Lee Jones	2681.3	46	58.3	Men in Black	250.7
34	Brad Pitt	2680.9	40	67.0	World War Z	202.4
35	Adam Sandler	2661.0	32	83.2	Hotel Transylvania 2	169.7
36	Daniel Radcliffe	2634.4	17	155.0	Harry Potter / Deathly Hallows (P2)	381.0
37	Jonah Hill	2605.1	29	89.8	The LEGO Movie	257.8
38	Owen Wilson	2602.3	39	66.7	Night at the Museum	250.9
39	Idris Elba	2580.6	26	99.3	Avengers: Age of Ultron	459.0
40	Bradley Cooper	2557.7	25	102.3	American Sniper	350.1
41	Mark Wahlberg	2549.8	36	70.8	Transformers 4	245.4
42	Jim Carrey	2545.2	27	94.3	The Grinch	260.0
43	Dustin Hoffman	2522.1	43	58.7	Meet the Fockers	279.3
44	Leonardo DiCaprio	2518.3	25	100.7	Titanic	658.7
45	Jeremy Renner	2500.3	21	119.1	The Avengers	623.4
46	Philip Seymour Hoffman	2463.7	40	61.6	Catching Fire	424.7
47	Sandra Bullock	2462.6	35	70.4	Minions	336.0
48	Chris Evans	2457.8	23	106.9	The Avengers	623.4
49	Anne Hathaway	2416.5	25	96.7	The Dark Knight Rises	448.1

### simple random sample of 5 rows

actors.sample(5, replace=False)

	Actor	Total Gross	Number of Movies	Average per Movie	#1 Movie	Gross
19	Robert DeNiro	3081.3	79	39.0	Meet the Fockers	279.3
16	Will Smith	3149.1	24	131.2	Independence Day	306.2
21	Liam Neeson	2942.7	63	46.7	The Phantom Menace	474.5
0	Harrison Ford	4871.7	41	118.8	Star Wars: The Force Awakens	936.7
27	Woody Harrelson	2815.8	50	56.3	Catching Fire	424.7

Since sample gives you the entire sample in the order in which the rows were selected, you can use Pandas methods on the sampled table to answer many questions about the sample. For example, you can find the number of times the die showed six spots, or the average number of movies in which the sampled actors appeared, or whether one specified actor appeared in the sample. You might need multiple lines of code to get some of this information.

16.1.2. Sampling an Array#

If you are sampling from a population of individuals whose data are represented as an array, you can use the NumPy function np.random.choice to randomly select elements of the array.

By default, np.random.choice samples at random with replacement.

### the faces of a die, as an array

faces

array([1, 2, 3, 4, 5, 6])

### 7 rolls of the die

np.random.choice(faces, 7)

array([4, 4, 2, 6, 5, 1, 3])

The argument replace=False allows you to get a simple random sample, that is, a sample drawn at random without replacement.

### array of actor names

actor_names = actors['Actor']

# Simple random sample of 5 actor names
np.random.choice(actor_names, 5, replace=False)

array(['Don Cheadle', 'Adam Sandler', 'Jeremy Renner', 'Michael Caine',
       'Stellan Skarsgard'], dtype=object)

Just as sample did, so also np.random.choice gives you the entire sequence of sampled elements. You can use array operations to answer many questions about the sample. For example, you can find which actor was the second one to be drawn, or the number of faces of the die that appeared more than once. Some answers might need multiple lines of code.

16.1.3. Sampling a Categorical Distribution#

Sometimes we are interested in a categorical attribute of our sampled individuals. For example, we might be looking at whether a coin lands Heads or Tails; or we might be interested in the political parties of randomly selected voters.

In such cases, we frequently need the proportions of sampled voters in the different categories. If we have the entire sample, we can calculate these proportions. The function sample_proportions does that work for us. It is tailored for sampling at random with replacement from a categorical distribution and returns the proportions of sampled elements in each category.

The sample_proportions function takes two arguments:

the sample size
the distribution of the categories in the population, as a list or array of proportions that add up to 1

It returns an array containing the distribution of the categories in a random sample of the given size taken from the population. That’s an array consisting of the sample proportions in all the different categories, in the same order in which they appeared in the population distribution.

For example, suppose each plant of a species is red-flowering with a chance of 25%, pink-flowering with a chance 50%, and white-flowering with a chance 25%, regardless of the flower colors of all other plants. You can use sample_proportions to see the proportions of the different colors among 300 plants of the species.

np.random.seed(42)
def sample_proportions(sample_size, probabilities):
    """Return the proportion of random draws for each outcome in a distribution.
    This function is similar to np.random.multinomial, but returns proportions
    instead of counts.
    Args:
        ``sample_size``: The size of the sample to draw from the distribution.
        ``probabilities``: An array of probabilities that forms a distribution.
    Returns:
        An array with the same length as ``probability`` that sums to 1.
    """
    return np.random.multinomial(sample_size, probabilities) / sample_size

### Species distribution of flower colors:
### Proportions are in the order Red, Pink, White

species_proportions = [0.25, 0.5, .25]

sample_size = 300

# Distribution of sample
sample_distribution = sample_proportions(sample_size, species_proportions)
sample_distribution

array([0.24, 0.5 , 0.26])

As you expect, the proportions in the sample sum to 1.

sum(sample_distribution)

np.float64(1.0)

The categories in species_proportions are in the order Red, Pink, White. That order is preserved by sample_proportions. If you just want the proportion of pink-flowering plants in the sample, you can use item:

### sample proportion of Heads

sample_distribution.item(1)

0.5

You can use sample_proportions and array operations to answer questions based only on the proportions of sampled individuals in the different categories. You will not be able to answer questions that require more detailed information about the sample, such as which of the sampled plants had each of the different colors.