15.2. Assessing Model #2#
In 2010, the American Civil Liberties Union (ACLU) of Northern California presented a report on jury selection in Alameda County, California. The report concluded that certain racial and ethnic groups are underrepresented among jury panelists in Alameda County, and suggested some reforms of the process by which eligible jurors are assigned to panels. As you saw in the previous section, trial jurors are selected from a larger jury panel. By law, the jury panel is supposed to be representative of the population of eligible jurors.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
15.2.1. Alameda County Jury Panels#
The focus of the study by the ACLU of Northern California was the racial and ethnic composition of jury panels in Alameda County. The ACLU compiled data on the composition of the jury panels in 11 felony trials in Alameda County in the years 2009 and 2010. In those panels, the total number of people who reported for jury service was 1453. The ACLU gathered demographic data on all of these prospective jurors and compared that data with the composition of all eligible jurors in the county.
The data are tabulated below in a table called jury. In each category, the first numerical value is the proportion of all eligible juror candidates in that category. The second value is the proportion of people in that category among those who appeared for the process of selection into the jury.
The labels for the different categories are taken from the ACLU report, which says, “The category ‘other’ includes people who identified as mixed race without identifying one primary racial identity, and individuals who did not identify a race or ethnicity.” The label Asian/PI means “Asian or Pacific Islander.” The label Black/AA means “Black or African-American.”
data = {
'Ethnicity': ['Asian/PI', 'Black/AA', 'Caucasian', 'Hispanic', 'Other'],
'Eligible': [0.15, 0.18, 0.54, 0.12, 0.01],
'Panels': [0.26, 0.08, 0.54, 0.08, 0.04]
}
jury = pd.DataFrame(data)
jury
| Ethnicity | Eligible | Panels | |
|---|---|---|---|
| 0 | Asian/PI | 0.15 | 0.26 |
| 1 | Black/AA | 0.18 | 0.08 |
| 2 | Caucasian | 0.54 | 0.54 |
| 3 | Hispanic | 0.12 | 0.08 |
| 4 | Other | 0.01 | 0.04 |
Some ethnicities are overrepresented and some are underrepresented on the jury panels in the study. In particular, panelists in the Black/AA and Hispanic categories are underrepresented. A bar chart is helpful for visualizing the differences.
jury.plot(kind='barh')
<Axes: >
15.2.2. Data vs. Random#
What if we select a random sample of 1,453 people from the population of eligible jurors? Will the distribution of their ethnicities look like the distribution of the panels above?
We can answer these questions by using sample_proportions and augmenting the jury table with a column of the proportions in our sample.
Technical note: Random samples of prospective jurors would be selected without replacement. However, when the size of a sample is small relative to the size of the population, sampling without replacement resembles sampling with replacement; the proportions in the population don’t change much between draws. The population of eligible jurors in Alameda County is over a million, and compared to that, a sample size of about 1500 is quite small. We will therefore sample with replacement.
In the cell below, we sample at random 1453 times from the distribution of eligible jurors, and display the distribution of the random sample along with the distributions of the eligible jurors and the panel in the data.
np.random.seed(42)
def sample_proportions(sample_size, probabilities):
"""Return the proportion of random draws for each outcome in a distribution.
This function is similar to np.random.multinomial, but returns proportions
instead of counts.
Args:
``sample_size``: The size of the sample to draw from the distribution.
``probabilities``: An array of probabilities that forms a distribution.
Returns:
An array with the same length as ``probability`` that sums to 1.
"""
return np.random.multinomial(sample_size, probabilities) / sample_size
# return np.random.multinomial(sample_size, probabilities)
eligible_population = jury['Eligible']
sample_distribution = sample_proportions(1453, eligible_population)
jury['Random Sample'] = sample_distribution
jury
| Ethnicity | Eligible | Panels | Random Sample | |
|---|---|---|---|---|
| 0 | Asian/PI | 0.15 | 0.26 | 0.143840 |
| 1 | Black/AA | 0.18 | 0.08 | 0.182381 |
| 2 | Caucasian | 0.54 | 0.54 | 0.555403 |
| 3 | Hispanic | 0.12 | 0.08 | 0.108741 |
| 4 | Other | 0.01 | 0.04 | 0.009635 |
The distribution of the random sample is quite close to the distribution of the eligible population, unlike the distribution of the panels. As always, it helps to visualize.
jury.plot('Ethnicity', kind='barh', figsize=(6, 5))
<Axes: ylabel='Ethnicity'>
The bar chart shows that the distribution of the random sample resembles the eligible population, but the distribution of the panels does not.
To assess whether this observation is particular to one random sample or more general, we can simulate multiple panels under the model of random selection and see what the simulations predict. But we won’t be able to look at thousands of bar charts like the one above. We need a statistic that will help us assess whether or not the model or random selection is supported by the data.
15.2.3. Statistic: Distance between Distributions#
We know how to measure how different two numbers are: if the numbers are x and y, the distance between them is | x - y |. Now we have to quantify the distance between two distributions. For example, we have to measure the distance between the blue and gold distributions below.
jury.drop('Random Sample', axis=1).plot('Ethnicity', kind='barh', figsize=(6, 5))
<Axes: ylabel='Ethnicity'>
For this, we will compute a quantity called the total variation distance between two distributions. The calculation is as an extension of how we find the distance between two numbers.
To compute the total variation distance, we first find the difference between the two proportions in each category.
jury = jury.drop('Random Sample', axis=1)
jury['Difference'] = jury['Panels'] - jury['Eligible']
jury
| Ethnicity | Eligible | Panels | Difference | |
|---|---|---|---|---|
| 0 | Asian/PI | 0.15 | 0.26 | 0.11 |
| 1 | Black/AA | 0.18 | 0.08 | -0.10 |
| 2 | Caucasian | 0.54 | 0.54 | 0.00 |
| 3 | Hispanic | 0.12 | 0.08 | -0.04 |
| 4 | Other | 0.01 | 0.04 | 0.03 |
Take a look at the column Difference and notice that the sum of its entries is 0: the positive entries add up to 0.14, exactly canceling the total of the negative entries which is -0.14.
This is numerical evidence of the fact that in the bar chart, the gold bars exceed the blue bars by exactly as much as the blue bars exceed the gold. The proportions in each of the two columns Panels and Eligible add up to 1, and so the give-and-take between their entries must add up to 0.
To avoid the cancellation, we drop the negative signs and then add all the entries. But this gives us two times the total of the positive entries (equivalently, two times the total of the negative entries, with the sign removed). We don’t need that doubling, so we divide the sum by 2.
jury['Absolute Difference'] = np.abs(jury['Difference'])
jury
| Ethnicity | Eligible | Panels | Difference | Absolute Difference | |
|---|---|---|---|---|---|
| 0 | Asian/PI | 0.15 | 0.26 | 0.11 | 0.11 |
| 1 | Black/AA | 0.18 | 0.08 | -0.10 | 0.10 |
| 2 | Caucasian | 0.54 | 0.54 | 0.00 | 0.00 |
| 3 | Hispanic | 0.12 | 0.08 | -0.04 | 0.04 |
| 4 | Other | 0.01 | 0.04 | 0.03 | 0.03 |
jury['Absolute Difference'].sum() / 2
np.float64(0.14)
This quantity 0.14 is the total variation distance (TVD) between the distribution of ethnicities in the eligible juror population and the distribution in the panels.
In general, the total variation distance between two distributions measures how close the distributions are. The larger the TVD, the more different the two distributions appear.
Technical Note: We could have obtained the same result by just adding the positive differences. But our method of including all the absolute differences eliminates the need to keep track of which differences are positive and which are not.
We will use the total variation distance between distributions as the statistic to simulate under the assumption of random selection. Large values of the distance will be evidence against random selection.
15.2.4. Simulating the Statistic Under the Model#
To see how the TVD varies across random samples, we will simulate it repeatedly under the model of random selection from the eligible population. Let’s organize our calculation. Since we are going to be computing total variation distance repeatedly, we will first write a function that computes it for two given distributions.
The function total_variation_distance takes two arrays containing the distributions to compare, and returns the TVD between them.
def total_variation_distance(distribution_1, distribution_2):
return sum(np.abs(distribution_1 - distribution_2)) / 2
This function will help us calculate our statistic in each repetition of the simulation. But first let’s check that it gives the right answer when we use it to compute the distance between the blue (eligible) and gold (panels) distributions above. These are the distribution in the ACLU study.
total_variation_distance( jury['Panels'], jury['Eligible'] )
0.14
This agrees with the value that we computed directly without using the function.
In the cell below, we use the function to compute the TVD between the distributions of the eligible jurors and one random sample. Recall that eligible_population is the array containing the distribution of the eligible jurors, and that our sample size is 1453.
In the first line, we use sample_proportions to generate a random sample from the eligible population. In the next line, we use total_variation_distance to compute the TVD between the distributions in the random sample and the eligible population.
sample_distribution = sample_proportions(1453, eligible_population)
total_variation_distance(sample_distribution, eligible_population)
0.012704748795595331
Run the cell a few times and notice that the distances are quite a bit smaller than 0.14, the distance between the distribution of the panels and the eligible jurors.
We are now ready to run a simulation to assess the model of random selection.
15.2.4.1. Simulating One Value of the Statistic#
In the same way that we start every simulation, let’s define a function one_simulated_tvd that returns one simulated value of the total variation distance under the hypothesis of random selection.
The code in the body of the definition is based on the cell above.
# Simulate one simulated value of
# the total variation distance between
# the distribution of a sample selected at random
# and the distribution of the eligible population
def one_simulated_tvd():
sample_distribution = sample_proportions(1453, eligible_population)
return total_variation_distance(sample_distribution, eligible_population)
15.2.4.2. Simulating Multiple Values#
Now we can apply the familiar process of using a for loop to create an array consisting of 5000 such distances.
tvds = np.array([])
repetitions = 5000
for i in np.arange(repetitions):
tvds = np.append(tvds, one_simulated_tvd())
15.2.5. Assessing the Model of Random Selection#
Here is the empirical histogram of the simulated distances. It shows that if you draw 1453 panelists at random from the pool of eligible candidates, then the distance between the distributions of the panelists and the eligible population is rarely more than about 0.05.
The panels in the study, however, were not quite so similar to the eligible population. The total variation distance between the panels and the population was 0.14, shown as the red dot on the horizontal axis. It is far beyond the tail of the histogram and does not look at all like a typical distance between the distributions of a random sample and the eligible population.
TVD = pd.DataFrame({'TVD': tvds})
fig, ax = plt.subplots()
ax.hist(TVD, bins=np.arange(0, 0.2, 0.005), density=True, ec='white') ### density == y-ticks (increment!)
y_vals = ax.get_yticks()
# ax.set_yticklabels(['{:g}'.format( x * 100) for x in y_vals ]) ### list comprehension
ax.set_yticklabels([ x * 100 for x in y_vals ]) ### list comprehension
ax.set_title('Prediction Assuming Random Selection')
ax.set_xlabel('TVD')
ax.set_ylabel('Percent Per Unit')
plt.xlim(0, 0.15)
plt.ylim(-0.75, 55)
plt.scatter(0.14, 0, color='red', s=30) ### plotting the red dot
/var/folders/g4/v24tl8t172g5d7rzsd63y51w0000gp/T/ipykernel_76354/3149039346.py:7: UserWarning: set_ticklabels() should only be used with a fixed number of ticks, i.e. after set_ticks() or using a FixedLocator.
ax.set_yticklabels([ x * 100 for x in y_vals ]) ### list comprehension
<matplotlib.collections.PathCollection at 0x1107f5bd0>
The simulation shows that the composition of the panels in the ACLU study is not consistent with the model of random selection. Our analysis supports the ACLU’s conclusion that the panels were not representative of the distribution provided for the eligible jurors.
15.2.6. Conclusion of the Data Analysis#
15.2.6.1. Data Quality#
We have developed a technique that helps us decide whether one distribution looks like a random sample from another. But we should examine our data sources to better understand exactly what we can conclude from our analysis. Good data science includes a thoughtful examination of how the data were gathered. For Case 2:
Eligible Jurors. It is important to keep in mind that not everyone is eligible to serve on a jury. On its website, the Superior Court of Alameda County says, “You may be called to serve if you are 18 years old, a U.S. citizen and a resident of the county or district where summoned. You must be able to understand English, and be physically and mentally capable of serving. In addition, you must not have served as any kind of juror in the past 12 months, nor have been convicted of a felony.”
Though the Census is a natural source of demographic information, the Census doesn’t collect data on all these categories. Thus the ACLU had to obtain the demographics of eligible jurors in some other way.
What they used was a set of estimates that had been developed by a San Diego State University professor for an Alameda County trial in 2002. Those estimates were based on the 2000 Census and also took into account the criteria required for eligibility as a juror.
So we have to keep in mind that the distribution that we used for the eligible population is itself an estimate and is therefore subject to error. Moreover, as the ACLU noted, using estimates based on the 2000 Census for populations in 2010 might not be accurate due to the changing demographics in California.
Thus the distribution used for the population of eligible jurors might not be accurate.
Panels. It is not clear from the report exactly how the 1453 panelists were classified into the different ethnic categories. The report says only that “attorneys … cooperated in collecting jury pool data”.
Significant social, cultural, and political factors affect who gets classified or self-classifies into each ethnic category. It is also not clear if the definitions of those categories in the panels were the same as those used in the estimates of the distribution of eligible jurors. This raises questions about the correspondence between the two distributions being compared.
Moreover, the data on panelists was obtained from those who reported for service. As we have discussed, not all panelists do so. The reasons for not reporting are associated with race and ethnicity, and disproportionately affect panelists from under-resourced communities.
Responding to jury summons is required by law. But as the ACLU noted, at the time of the study Alameda County did not have an effective process for following up on prospective jurors who had been called but had failed to appear. The ACLU recommended the creation of “a failure to appear program that requires, at a minimum, the issuance of a second notice to prospective jurors.”
Indeed, following up on non-respondents would reduce bias. For our analysis, we must acknowledge that we don’t have the data needed to fill in the demographic information for panelists who were selected but did not report for service.