14. Descriptive Statistics & Distribution#

Goals

  • Describe a variable using center, spread, and shape

  • Compute descriptive statistics: mean, median, variance, standard deviation

  • Understand distributions

  • Connect probability to data-generating processes

  • Randomness

  • The Law of Large Numbers

### headers
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

14.1. Data and Variables#

When describing data, make sure you cover center, spread, and shape.

tips = sns.load_dataset('tips')
tips                      ### size means number of ppl in the party
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
... ... ... ... ... ... ... ...
239 29.03 5.92 Male No Sat Dinner 3
240 27.18 2.00 Female Yes Sat Dinner 2
241 22.67 2.00 Male Yes Sat Dinner 2
242 17.82 1.75 Male No Sat Dinner 2
243 18.78 3.00 Female No Thur Dinner 2

244 rows × 7 columns

tips.shape       ### a dataframe has shape, which is a tuple
(244, 7)

14.1.1. Population vs sample#

Population is the full group we care about (e.g., “all lunch bills at the restaurant this year”), whereas sample is the subset we actually measured (e.g., the 244 rows in the tips dataset). We only ever see the sample, but we talk like we know the population.

14.1.2. Variable vs Observation#

Each row in the dataframe/dataset is an observation (e.g., one restaurant bill), whereas columns are variables (e.g., total_bill, tip, smoker, time (Dinner/Lunch), etc.).

  • Quantitative variable: e.g., total_bill (numeric, measured)

  • Categorical variable: time (Lunch vs Dinner), smoker (Yes/No)

14.1.3. Distribution#

A distribution shows how often different values occur. For numeric data:

  • Are most bills around $10? $20? $50?

  • Are there a few very large bills? (outliers)

For categorical data:

  • How common is “Dinner” vs “Lunch”?

14.1.4. Descriptive Statistics#

df.describe() gives you the descriptive statistics of the dataset.

tips.describe()
total_bill tip size
count 244.000000 244.000000 244.000000
mean 19.785943 2.998279 2.569672
std 8.902412 1.383638 0.951100
min 3.070000 1.000000 1.000000
25% 13.347500 2.000000 2.000000
50% 17.795000 2.900000 2.000000
75% 24.127500 3.562500 3.000000
max 50.810000 10.000000 6.000000

Key summaries for a numeric variable:

  • Mean (average)

  • Median (middle value)

  • Standard deviation (typical distance from the mean)

  • IQR (middle 50% range)

Let’s compute them for total_bill.

14.1.5. Center: mean vs median#

col = tips['total_bill']

summary = pd.DataFrame({
    'mean': [col.mean()],
    'median': [col.median()],
    'std (SD)': [col.std()],
    'IQR': [col.quantile(0.75) - col.quantile(0.25)]
})
summary
mean median std (SD) IQR
0 19.785943 17.795 8.902412 10.78

14.1.6. Visualizing One Variable#

We’ll use:

  • Histogram (counts or density)

  • Boxplot (median, quartiles, outliers)

  • KDE / density curve (smoothed shape)

Interpretation checklist: center / spread / shape / outliers.

sns.set_theme()

fig, axes = plt.subplots(1, 2, figsize=(10,4))

sns.histplot(tips, x='total_bill', kde=True, ax=axes[0])
axes[0].set_title('Histogram + KDE')

sns.boxplot(y=tips['total_bill'], ax=axes[1])
axes[1].set_title('Boxplot')

plt.tight_layout()
../../_images/db811e3fa6f728ac1bb2f6e2e5d853128456d97305eebc9994cf4cc67696881d.png

14.2. Distributions#

14.2.1. Normal Distribution#

A normal distribution—also called a Gaussian distribution—is a symmetrical, bell-shaped probability distribution centered on the mean, where most data points cluster around that mean with frequencies tapering off equally toward both extremes.

Key Properties:

  • Defined by its mean (μ) and standard deviation (σ).

  • The mean, median, and mode are equal and located at the center.​

  • Symmetrical: the left and right sides are mirror images.

  • 68% of data falls within 1 standard deviation, 95% within 2, and 99.7% within 3 standard deviations from the mean (“Empirical Rule”).​

  • Frequently models real-world phenomena due to the Central Limit Theorem.

Example Applications:

  • Heights of People: Human heights in a given population typically follow a normal distribution—most people are near the average height, with fewer extremely short or tall.​

  • IQ Scores: Measured intelligence is designed to follow a normal curve, with an average of 100 and most people close to the average.​

  • Measurement Errors: Repeated measurements of something (e.g., lengths, weights) often cluster around the true value, following a normal distribution.​

  • Test Scores: Large-scale standardized test results (e.g., SAT, GRE) often approximate a normal distribution after some normalization.

Example

Mean (μ)

Standard Deviation (σ)

What is Measured

Adult heights

170 cm

10 cm

Human height distribution

IQ scores

100

15

Intelligence

SAT scores

1050

200

Standardized tests

Let us use the random module in NumPy to explore.

from numpy import random
random.seed(42)

x = random.normal(size=(2, 3))
print(type(x))
print(x)
<class 'numpy.ndarray'>
[[ 0.49671415 -0.1382643   0.64768854]
 [ 1.52302986 -0.23415337 -0.23413696]]

Now let’s add mean and standard deviation (because this is a normal distribution). Here we are controlling the center and the spread of the distribution.

### generate a random normal distribution of size 2X3 with mean at 1 and SD of 2

random.seed(42)
x = random.normal(loc=1, scale=2, size=(2, 3))
print(x)
[[1.99342831 0.7234714  2.29537708]
 [4.04605971 0.53169325 0.53172609]]

Time for some visualization. Note that we are using height and aspect (ratio) to control figure size.

random.seed(42)

np.random.seed(42)
sns.displot(random.normal(size=1000), height=3, aspect=2,kde=True)
<seaborn.axisgrid.FacetGrid at 0x1159eec10>
../../_images/55cbcd06713e5f8ec87d1ba982de23c4f24d71ca12f1a7e4ef98c11f76827a37.png

A KDE only plot. Note that we are controlling figure size by

  • plt.figure(figsize)

  • dpi

random.seed(42)

np.random.seed(42)

plt.figure(figsize=(6.5, 3))
sns.kdeplot(random.normal(size=1000))
<Axes: ylabel='Density'>
../../_images/b3ac7681b457c78c8345ff7274c35573d58b90edfb24b90c9e049a35fc89766e.png

Now let’s add dpi to control figure size.

np.random.seed(42)

import matplotlib as mpl
with mpl.rc_context({"figure.dpi": 100}):
    plt.figure(figsize=(6.5, 3))
    sns.kdeplot(random.normal(size=1000))
../../_images/b3ac7681b457c78c8345ff7274c35573d58b90edfb24b90c9e049a35fc89766e.png
np.random.seed(42)

import matplotlib as mpl
with mpl.rc_context({"figure.dpi": 50}):
    plt.figure(figsize=(6.5, 3))
    sns.kdeplot(random.normal(size=1000))
../../_images/eabbb54c94ac8b6944edfac1fffbea73a0e08f33ef5a446b4df4ae5846f779bc.png

14.2.2. Binomial Distribution#

A binomial distribution is a discrete probability distribution that models the number of successes in a fixed number of independent trials, where each trial has only two outcomes: “success” or “failure.” Every trial must have the same probability of success. This distribution is fundamental for analyzing repeated yes/no-type experiments (Bernoulli trials) and appears frequently in statistics and data science.

Key Features:

  • There are n fixed trials.

  • Each trial is independent.

  • There are only two possible outcomes per trial (success or failure).

  • The chance of success p is constant for each trial.

For example, flipping a fair coin 10 times and counting the number of heads. Here, each flip is independent, and the probability of heads (success) is 0.5. The distribution of the number of heads follows a binomial distribution with n=10, p=0.5.

# binomial vs normal distributions
random.seed(42)

norm = random.normal(loc=50, scale=5, size=1000)
bino = random.binomial(n=100, p=0.5, size=1000)
fig, ax = plt.subplots(figsize=(4, 3))
sns.kdeplot(data=norm, label='Normal', color='red')
sns.kdeplot(data=bino, label='Binomial', color='green')
plt.tight_layout()
../../_images/67c4915d9fbffce73ea7d211a556ca8cd7c9e28c1d15a56c858620761a9f6db3.png

Visually comparing normal distribution and binomial distribution.

### test out 1 scaler
random.binomial(n=100, p=0.5, size=1)
array([45])
### normal distribution vs binomial
### binomial approximates normal distribution when n is big (np >=5)
### binomial "spikes" because it's integers

random.seed(42)
norm = random.normal(loc=50, scale=5, size=1000)  ### mean=50, SD=5 to match binomial
bino = random.binomial(n=100, p=0.5, size=1000)

fig, ax = plt.subplots(1, 2, figsize=(8, 3))
sns.histplot(norm, label='normal', color='red', ax=ax[0], kde=True)
sns.histplot(bino, label="binomial", color='green', ax=ax[1], kde=True)
fig.legend()
fig.tight_layout()
../../_images/02386812dde3c386442124099ace292d83828be7c29cc342b72fcf166c56d877.png

Coin flipping! (If you count the numbers of heads, it will be a normal distribution)

# Define the coin sides
coin = ['Heads', 'Tails']

# Simulate 100 coin flips
random.seed(42)
flips = np.random.choice(coin, size=100)

# Count the number of heads
num_heads = np.count_nonzero(flips == 'Heads')

print("Number of Heads:", num_heads)

sns.displot(flips, height=3, aspect=2, bins=10, kde=False)  
Number of Heads: 44
<seaborn.axisgrid.FacetGrid at 0x11598d1d0>
../../_images/8a90e78d20e83c190de96b2dfbfa5d4facff2af576bc11a4f7618eb478ed8c67.png

14.2.3. Uniform Distribution#

A uniform distribution is a probability distribution where every outcome within a certain range is equally likely to occur. This means there is no bias—each result has the same probability as any other within the specified set.

Examples:

  • Rolling a fair six-sided die. Each face (1 through 6) has a 1/6 chance of appearing, so the probability of rolling any particular number is equal.​

  • Flipping a fair coin. Heads and tails each have a 50% chance.​

The numpy.random.uniform() method draws samples from a uniform distribution. Samples are uniformly distributed over the half-open interval [low, high) (includes low, but excludes high). In other words, any value within the given interval is equally likely to be drawn by uniform.

The syntax:

random.uniform(low=0.0, high=1.0, size=None)

random.seed(42)

unif1 = random.uniform()                               ### return 1 number [0, 1)
unif2 = random.uniform(low=1, high=5, size=3)          ### return size=3 numbers [0, 1)
unif3 = random.uniform(low=5, high=10, size=(3, 3))    ### return a 3x3 2D array
unif4 = np.round(random.uniform(low=5, high=10, size=(3, 3)), 2)     ### rounded
unif5 = random.randint(low=5, high=10, size=(3, 3))    ### int

for unif in [ unif1, unif2, unif3, unif4, unif5 ]:
    print(unif, '\n')         
0.3745401188473625 

[4.80285723 3.92797577 3.39463394] 

[[5.7800932  5.7799726  5.29041806]
 [9.33088073 8.00557506 8.54036289]
 [5.10292247 9.84954926 9.1622132 ]] 

[[6.06 5.91 5.92]
 [6.52 7.62 7.16]
 [6.46 8.06 5.7 ]] 

[[8 8 7]
 [8 8 5]
 [7 9 7]] 

14.2.3.1. Best Example: Rolling Die#

We eventually flip 1 million times here and see that all 6 numbers and have equal probability distribution.

### flip for many times

fig, axes = plt.subplots(1, 5, figsize=(12, 3))

flips = []
for i, ran in enumerate([ 10, 100, 1000, 10000, 1000000 ]):
    for flip in range(ran):                  ### if n > 26, print vertically
        flips.append(random.randint(1, 7))
    # sns.histplot(flips, ax=axes[i])
    sns.histplot(flips, ax=axes[i], bins=6)

plt.tight_layout()
../../_images/76ecb7b5ba4ed55e8483c142a4b942b9866ae69f2bc6a2a550e722f0dd7a1b0e.png

Note that binning can affect the plotting greatly.

### displot with height and aspect

sns.displot(flips, height=2.5, aspect=3/2)
sns.displot(flips, height=2.5, aspect=3/2, bins=6)
<seaborn.axisgrid.FacetGrid at 0x1169302d0>
../../_images/8d7adb4ce21f6c1446f8716cb3cdaef81705bf9cadadf8a1e56bb80c2eecbb28.png ../../_images/e391d3cf416cd56e767aeda6ae0d31df19aeb69a58b38a6c57edbdc65914336b.png

Now use randint() to append 1000,000 random ints.

### flip for 1 million times
random.seed(42)

flips2 = []
for flip in range(1000000):
    flips2.append(random.randint(1, 7))

plt.figure(figsize=(4, 3))
sns.histplot(flips2)

print(len(flips2))
print(len(flips2)/6)
1000000
166666.66666666666
../../_images/f304b08103e61da1e3c4346e3b65181b76bebaaaf64250d48e64b3de45b014e4.png

Use the random.uniform() function to check. Looks uniform to me.

random.seed(42)
unif = random.uniform(0, 5, size=1000000)

fig, ax = plt.subplots(1, 2, figsize=(6, 4))

sns.histplot(data=unif, bins=6, ax=ax[0])
sns.histplot(data=unif, bins=100, ax=ax[1])

plt.tight_layout()
../../_images/853f38e3c4e8e6298bc0b1b35f653b3d1539ac9b77d1529aba84241620c297b3.png

But, when we have smaller n and larger bins, we see spikes:

random.seed(42)
unif = random.uniform(0, 5, size=1000)    ### low, high exclusive
sns.displot(data=unif, kind='hist', height=3, aspect=3/2, bins=100)
# sns.displot(unif, kind='kde')
# plt.show()
<seaborn.axisgrid.FacetGrid at 0x119279f90>
../../_images/23afbc4b3ae9e7da5fb652f19951731261c1f2c82da8c98ca7d970491a3bdb15.png

Using scatter plot to visualize.

random.seed(42)
unif = random.uniform(0, 5, size=1000)

sns.scatterplot(data=unif)
<Axes: >
../../_images/4b9ad9b3d0ecee22cf29895b91eb8b418df45a22955f1a8250ce57dcf1d119de.png
fig, axes = plt.subplots(1, 4, figsize=(12, 3)) ### now we have one fig with two axes

unif1 = random.uniform(0, 5, size=100)
unif2 = random.uniform(0, 5, size=1000)
unif3 = random.uniform(0, 5, size=10000)
unif4 = random.uniform(0, 5, size=10000000)

sns.scatterplot(unif1, ax=axes[0])          
sns.scatterplot(unif2, ax=axes[1])
sns.scatterplot(unif3, ax=axes[2])
sns.scatterplot(unif4, ax=axes[3])
<Axes: >
../../_images/9294fe419dcfb764bb15633aebb7007c255e66c7642768e49fed55b17d448de2.png

Now, produce a similar plot like the one above for random.randn(). Note the binning strategy here.

fig, axes = plt.subplots(1, 4, figsize=(12, 3)) ### now we have one fig with two axes

datasets = [ unif1, unif2, unif3, unif4 ]
for i, data in enumerate(datasets):

    q25, q75 = np.percentile(data, [25, 75])            ### bins
    bin_width = 2 * (q75 - q25) * len(data) ** (-1/3)   ### bins
    bins = int((data.max() - data.min()) / bin_width)   ### bins 
    
    sns.histplot(data=data, ax=axes[i], bins=bins)      ### plot

plt.tight_layout()
../../_images/449316f8327fda4bb751b4bfb4b9744aef496c797d7e936957c52c2661431ecd.png

Visualizing using KDE.

fig, axes = plt.subplots(1, 4, figsize=(12, 3)) ### now we have one fig with two axes

for i, data in enumerate(datasets):

    ### bins ###
    q25, q75 = np.percentile(data, [25, 75])
    bin_width = 2 * (q75 - q25) * len(data) ** (-1/3)
    bins = int((data.max() - data.min()) / bin_width)   ### bins #
    
    sns.kdeplot(data=data, ax=axes[i])      ### plot

plt.tight_layout()
../../_images/ebf9ddcd2456f931a140afcbb7e879fa79232a6fd6ac1c5348aba474ef2f0513.png

14.3. The Law of Large Numbers#

We’ll simulate coin flips to show:

  • Randomness

  • Law of large numbers (the running average stabilizes)

Each flip is 0 (tails) or 1 (heads).

tips.sample(5)
total_bill tip sex smoker day time size
101 15.38 3.00 Female Yes Fri Dinner 2
127 14.52 2.00 Female No Thur Lunch 2
75 10.51 1.25 Male No Sat Dinner 2
22 15.77 2.23 Female No Sat Dinner 2
220 12.16 2.20 Male Yes Fri Lunch 2
rng = np.random.default_rng(42)

type(rng.integers(0, 2, 10))
rng = np.random.default_rng(42) 
flips = rng.integers(0, 2, 200)
flips
array([0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 0,
       0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0,
       0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0,
       1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0,
       0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0,
       1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0,
       0, 1])
np.arange(len(flips)) + 1     ### fixed and grow linearly
array([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,
        14,  15,  16,  17,  18,  19,  20,  21,  22,  23,  24,  25,  26,
        27,  28,  29,  30,  31,  32,  33,  34,  35,  36,  37,  38,  39,
        40,  41,  42,  43,  44,  45,  46,  47,  48,  49,  50,  51,  52,
        53,  54,  55,  56,  57,  58,  59,  60,  61,  62,  63,  64,  65,
        66,  67,  68,  69,  70,  71,  72,  73,  74,  75,  76,  77,  78,
        79,  80,  81,  82,  83,  84,  85,  86,  87,  88,  89,  90,  91,
        92,  93,  94,  95,  96,  97,  98,  99, 100, 101, 102, 103, 104,
       105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117,
       118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130,
       131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143,
       144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156,
       157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,
       170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182,
       183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195,
       196, 197, 198, 199, 200])
flips.cumsum()                ### grow slower because of 0, 1
array([  0,   1,   2,   2,   2,   3,   3,   4,   4,   4,   5,   6,   7,
         8,   9,  10,  11,  11,  12,  12,  13,  13,  13,  14,  15,  16,
        16,  17,  18,  18,  18,  18,  18,  19,  20,  20,  21,  22,  22,
        23,  23,  24,  25,  25,  25,  26,  26,  27,  28,  29,  30,  30,
        30,  30,  30,  30,  31,  31,  32,  33,  34,  35,  35,  36,  36,
        36,  37,  37,  37,  37,  38,  38,  38,  38,  39,  39,  39,  39,
        40,  41,  42,  42,  42,  43,  44,  45,  45,  45,  46,  47,  47,
        48,  49,  49,  50,  50,  50,  51,  52,  52,  53,  53,  54,  54,
        55,  56,  57,  58,  58,  59,  59,  60,  61,  61,  62,  63,  63,
        63,  63,  63,  63,  64,  65,  65,  66,  67,  67,  68,  69,  70,
        71,  72,  72,  73,  74,  74,  75,  75,  75,  75,  76,  76,  76,
        76,  77,  78,  78,  78,  79,  79,  80,  80,  81,  81,  81,  82,
        82,  83,  84,  85,  86,  87,  87,  87,  87,  88,  88,  88,  88,
        88,  88,  88,  89,  90,  91,  91,  92,  92,  93,  94,  95,  95,
        95,  96,  96,  96,  96,  96,  96,  96,  97,  98,  99,  99, 100,
       100, 100, 100, 100, 101])
flips.cumsum() / (np.arange(len(flips)) + 1)  ### will get as big as about half of len() 
array([0.        , 0.5       , 0.66666667, 0.5       , 0.4       ,
       0.5       , 0.42857143, 0.5       , 0.44444444, 0.4       ,
       0.45454545, 0.5       , 0.53846154, 0.57142857, 0.6       ,
       0.625     , 0.64705882, 0.61111111, 0.63157895, 0.6       ,
       0.61904762, 0.59090909, 0.56521739, 0.58333333, 0.6       ,
       0.61538462, 0.59259259, 0.60714286, 0.62068966, 0.6       ,
       0.58064516, 0.5625    , 0.54545455, 0.55882353, 0.57142857,
       0.55555556, 0.56756757, 0.57894737, 0.56410256, 0.575     ,
       0.56097561, 0.57142857, 0.58139535, 0.56818182, 0.55555556,
       0.56521739, 0.55319149, 0.5625    , 0.57142857, 0.58      ,
       0.58823529, 0.57692308, 0.56603774, 0.55555556, 0.54545455,
       0.53571429, 0.54385965, 0.53448276, 0.54237288, 0.55      ,
       0.55737705, 0.56451613, 0.55555556, 0.5625    , 0.55384615,
       0.54545455, 0.55223881, 0.54411765, 0.53623188, 0.52857143,
       0.53521127, 0.52777778, 0.52054795, 0.51351351, 0.52      ,
       0.51315789, 0.50649351, 0.5       , 0.50632911, 0.5125    ,
       0.51851852, 0.51219512, 0.5060241 , 0.51190476, 0.51764706,
       0.52325581, 0.51724138, 0.51136364, 0.51685393, 0.52222222,
       0.51648352, 0.52173913, 0.52688172, 0.5212766 , 0.52631579,
       0.52083333, 0.51546392, 0.52040816, 0.52525253, 0.52      ,
       0.52475248, 0.51960784, 0.52427184, 0.51923077, 0.52380952,
       0.52830189, 0.53271028, 0.53703704, 0.53211009, 0.53636364,
       0.53153153, 0.53571429, 0.53982301, 0.53508772, 0.53913043,
       0.54310345, 0.53846154, 0.53389831, 0.52941176, 0.525     ,
       0.52066116, 0.52459016, 0.52845528, 0.52419355, 0.528     ,
       0.53174603, 0.52755906, 0.53125   , 0.53488372, 0.53846154,
       0.54198473, 0.54545455, 0.54135338, 0.54477612, 0.54814815,
       0.54411765, 0.54744526, 0.54347826, 0.53956835, 0.53571429,
       0.53900709, 0.53521127, 0.53146853, 0.52777778, 0.53103448,
       0.53424658, 0.53061224, 0.52702703, 0.53020134, 0.52666667,
       0.52980132, 0.52631579, 0.52941176, 0.52597403, 0.52258065,
       0.52564103, 0.52229299, 0.52531646, 0.52830189, 0.53125   ,
       0.53416149, 0.53703704, 0.53374233, 0.5304878 , 0.52727273,
       0.53012048, 0.52694611, 0.52380952, 0.52071006, 0.51764706,
       0.51461988, 0.51162791, 0.51445087, 0.51724138, 0.52      ,
       0.51704545, 0.51977401, 0.51685393, 0.51955307, 0.52222222,
       0.52486188, 0.52197802, 0.51912568, 0.52173913, 0.51891892,
       0.51612903, 0.51336898, 0.5106383 , 0.50793651, 0.50526316,
       0.5078534 , 0.51041667, 0.51295337, 0.51030928, 0.51282051,
       0.51020408, 0.50761421, 0.50505051, 0.50251256, 0.505     ])
rng = np.random.default_rng(42)        ### random generator; seed: 42

flips = rng.integers(0, 2, size=200)   ### 200 0 or 1
running_prop = flips.cumsum() / (np.arange(len(flips)) + 1)     ### cumsum(): cumulative sum
                                                                ### output: an array
plt.figure(figsize=(12,4))
plt.plot(running_prop, marker='o', markersize=2, linewidth=1)   ### matplotlib.pyplot the array
plt.axhline(0.5, color='red', linestyle='--')
plt.xlabel('Number of flips')
plt.ylabel('Proportion heads so far')
plt.title('Law of Large Numbers intuition')
plt.ylim(0,1)
plt.show()
../../_images/05c587a789dfe378fd558db38c9f28a824067a63fdee228602995c6a2475dcff.png