Aggregation and groupby

7.6. Aggregation and groupby#

The groupby method allows you to group rows of data together and call aggregate functions

Efficient summarization is the core of data analysis. Aggregations (such as sum, mean, median, minimum, and maximum) compress key properties of datasets into single values, which can provide significant insights into data when conducting exploratory data analysis. This section examines aggregation in Pandas, starting with array-like operations comparable to those in NumPy, and advancing to more powerful techniques built on grouping (the concept of groupby).

import numpy as np
import pandas as pd

7.6.1. Simple Aggregations#

Now we can start with reviewing some simple aggregations on random values using NumPy and Pandas, e.g., sum and mean. Note that these operations return single values.

### let's use NumPy's new random API to generate some random numbers

### create a random number generator with a fixed SEED for reproducibility
rng = np.random.default_rng(42)
ser = pd.Series(rng.integers(0, 10, 5))
ser
0    0
1    7
2    6
3    4
4    4
dtype: int64
### sum
print(ser.sum())
21
### mean
print(ser.mean())
4.2

Now let’s try a few more aggregations on a DataFrame.

### create a DataFrame with random numbers for two columns 'A' and 'B'
### using the same random number generator
### note this is a dictionary where keys are column names and 
### values are sequences (lists/arrays/range) of data

rng = np.random.default_rng(42)
df = pd.DataFrame(
    {
     'A': rng.random(5),
     'B': rng.random(5)
     }
)
df
A B
0 0.773956 0.975622
1 0.438878 0.761140
2 0.858598 0.786064
3 0.697368 0.128114
4 0.094177 0.450386
df.mean()  ### mean for each column
A    0.572596
B    0.620265
dtype: float64
df.sum()   ### sum for each column
A    2.862978
B    3.101326
dtype: float64

Now, we’ll start with loading the Planets dataset (which is available through the Seaborn visualization package and will be covered later). This dataset has details about more than one thousand extrasolar planets discovered up to 2014.

### more than likely you have not installed seaborn
### you can do so by running the following command in your terminal
### pip install seaborn
### or uncomment the following line to install via pip directly in the notebook 

# %pip install seaborn

### don't forget to comment it back out after installation
### run this cell to load the dataset

import seaborn as sns
planets = sns.load_dataset('planets')   ### syntax to load a dataset from seaborn
### check out the first few rows of the dataset

planets.head()
method number orbital_period mass distance year
0 Radial Velocity 1 269.300 7.10 77.40 2006
1 Radial Velocity 1 874.774 2.21 56.95 2008
2 Radial Velocity 1 763.000 2.60 19.84 2011
3 Radial Velocity 1 326.030 19.40 110.62 2007
4 Radial Velocity 1 516.220 10.50 119.47 2009

Aggregation methods work similarly/the same. Here’s a list of Pandas aggregation methods.

Aggregation

Returns

count

Total number of items

first, last

First and last item

mean, median

Mean and median

min, max

Minimum and maximum

std, var

Standard deviation and variance

mad

Mean absolute deviation

prod

Product of all items

sum

Sum of all items

7.6.2. groupby#

While simple aggregations provide a quick overview of a dataset, analyses often require conditional aggregation by category or index level. In Pandas, this is accomplished with the groupby operation. The term originates from SQL’s GROUP BY, but conceptually it follows Hadley Wickham’s split–apply–combine paradigm: split the data into groups, apply a function to each group, and combine the results into a single output groupby.

../../_images/groupby.png

Fig. 7.1 A visual representation of a groupby operation [Vanderplas, 2022]#

### create sample dataframe
### start with a dictionary

data = {
    'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
    'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
    'Sales': [200, 120, 340, 124, 243, 350]
}
### check out the data dictionary

data
{'Company': ['GOOG', 'GOOG', 'MSFT', 'MSFT', 'FB', 'FB'],
 'Person': ['Sam', 'Charlie', 'Amy', 'Vanessa', 'Carl', 'Sarah'],
 'Sales': [200, 120, 340, 124, 243, 350]}
### create the dataframe

df = pd.DataFrame(data)
df
Company Person Sales
0 GOOG Sam 200
1 GOOG Charlie 120
2 MSFT Amy 340
3 MSFT Vanessa 124
4 FB Carl 243
5 FB Sarah 350

Now you can use the .groupby() method to group rows together based off of a column name. For instance let’s group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x10c2cb250>

You can save this object as a new variable:

by_comp = df.groupby("Company")

And then call aggregate methods off the object:

# by_comp.mean()
by_comp['Sales'].mean()
Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64
# df.groupby('Company').mean()
df.groupby('Company')['Sales'].mean()
Company
FB      296.5
GOOG    160.0
MSFT    232.0
Name: Sales, dtype: float64

More examples of aggregate methods:

by_comp['Sales'].std()
Company
FB       75.660426
GOOG     56.568542
MSFT    152.735065
Name: Sales, dtype: float64
by_comp.min()
Person Sales
Company
FB Carl 243
GOOG Charlie 120
MSFT Amy 124
by_comp.max()
Person Sales
Company
FB Sarah 350
GOOG Sam 200
MSFT Vanessa 340
by_comp.count()
Person Sales
Company
FB 2 2
GOOG 2 2
MSFT 2 2
by_comp.describe()
Sales
count mean std min 25% 50% 75% max
Company
FB 2.0 296.5 75.660426 243.0 269.75 296.5 323.25 350.0
GOOG 2.0 160.0 56.568542 120.0 140.00 160.0 180.00 200.0
MSFT 2.0 232.0 152.735065 124.0 178.00 232.0 286.00 340.0
by_comp.describe().transpose()
Company FB GOOG MSFT
Sales count 2.000000 2.000000 2.000000
mean 296.500000 160.000000 232.000000
std 75.660426 56.568542 152.735065
min 243.000000 120.000000 124.000000
25% 269.750000 140.000000 178.000000
50% 296.500000 160.000000 232.000000
75% 323.250000 180.000000 286.000000
max 350.000000 200.000000 340.000000
by_comp.describe().transpose()['GOOG']
Sales  count      2.000000
       mean     160.000000
       std       56.568542
       min      120.000000
       25%      140.000000
       50%      160.000000
       75%      180.000000
       max      200.000000
Name: GOOG, dtype: float64