Sharing some programming knowledge.

0%

Pandas:groupby(Grouping and Aggregating the Data)

groupby()

1
import pandas as pd
1
2
users = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user", sep="|", index_col='user_id')
users.head()

age gender occupation zip_code
user_id
1 24 M technician 85711
2 53 F other 94043
3 23 M writer 32067
4 24 M technician 43537
5 33 F other 15213
1
2
# Use DataFrame.groupby() to return a groupby object
users.groupby("occupation")
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f68cbf2c790>
1
2
3
# Functions like max(), min(), mean(), first(), last() can be applied to the groupby object to calculate summary statistics for each group
# Here we calulated the mean age for each occupation
users.groupby("occupation").mean().head()

age
occupation
administrator 38.746835
artist 31.392857
doctor 43.571429
educator 42.010526
engineer 36.388060
1
2
# You can group by multiple columns
users.groupby(["occupation", "gender"]).mean().head()

age
occupation gender
administrator F 40.638889
M 37.162791
artist F 30.307692
M 32.333333
doctor M 43.571429
1
2
3
# You can specify which column of the groups to calculate
# Here we choose zip_code column
users.groupby(["occupation", "gender"]).zip_code.min().head()
occupation     gender
administrator  F         03062
               M         02154
artist         F         02139
               M         01945
doctor         M         47401
Name: zip_code, dtype: object
1
2
3
4
# There are many parameters in groupby(): 
# DataFrame.groupby(self, by=None, axis=0, level=None, as_index: bool = True, sort: bool = True,
# group_keys: bool = True, squeeze: bool = False, observed: bool = False)
users.groupby(["occupation", "gender"], as_index=False, sort=True).mean().head()

occupation gender age
0 administrator F 40.638889
1 administrator M 37.162791
2 artist F 30.307692
3 artist M 32.333333
4 doctor M 43.571429

agg()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per 
# group in one calculation.
# Usage of agg():

# simple usage:
gender_ocup = users.groupby(['occupation','gender']).agg('count')
# It equivalent with gender_ocup = users.groupby(['occupation','gender']).count()

# use a dictionary:
# The dictionary keys are used to specify the columns upon which you’d like to perform operations, and the
# dictionary values to specify the function (name) to run.
# use numpy aggregation functions
gender_ocup = users.groupby(['occupation', 'gender']).agg({'gender': 'count'})
gender_ocup.head()

age
occupation gender
administrator F 40
M 49
artist F 29
M 25
doctor M 36
1
2
3
# can use lambdas to cutermize your function
gender_ocup = users.groupby(['occupation', 'gender']).agg({'age': lambda x: max(x)-min(x)})
gender_ocup.head()

age
occupation gender
administrator F 40
M 49
artist F 29
M 25
doctor M 36
1
2
3
# Apply multiple functions to columns in groups using a list of functions:
gender_ocup = users.groupby(['occupation', 'gender']).agg({'age': ['min', 'max', 'mean']})
gender_ocup.head()

age
min max mean
occupation gender
administrator F 22 62 40.638889
M 21 70 37.162791
artist F 19 48 30.307692
M 20 45 32.333333
doctor M 28 64 43.571429
1
2
3
# Rename grouped aggregation columns using tuple:
gender_ocup = users.groupby(['occupation', 'gender']).agg(gender_count=('gender','count'),min_age=('age','min'))
gender_ocup.head()

gender_count min_age
occupation gender
administrator F 36 22
M 43 21
artist F 13 19
M 15 20
doctor M 7 28

Reference:
summarising-aggregation-and-grouping-data-in-python-pandas