Data Wrangling in Python – How to Apply Operations To Groups In Pandas

Apply Operations To Groups In Pandas

Preliminaries


/* import modules */
import pandas as pd
/* Create dataframe */
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'], 
        'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'], 
        'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'], 
        'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70

/* Create a groupby variable that groups preTestScores by regiment */
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment
<pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>

“This grouped variable is now a GroupBy object. It has not actually computed anything yet except for some intermediate data about the group key df['key1']. The idea is that this object has all of the information needed to then apply some operation to each of the groups.” – Python for Data Analysis.

 

View a grouping

Use list() to show what a grouping looks like

list(df['preTestScore'].groupby(df['regiment']))
[('Dragoons', 4     3
  5     4
  6    24
  7    31
  Name: preTestScore, dtype: int64), ('Nighthawks', 0     4
  1    24
  2    31
  3     2
  Name: preTestScore, dtype: int64), ('Scouts', 8     2
  9     3
  10    2
  11    3
  Name: preTestScore, dtype: int64)]

Descriptive statistics by group

df['preTestScore'].groupby(df['regiment']).describe()
count mean std min 25% 50% 75% max
regiment
Dragoons 4.0 15.50 14.153916 3.0 3.75 14.0 25.75 31.0
Nighthawks 4.0 15.25 14.453950 2.0 3.50 14.0 25.75 31.0
Scouts 4.0 2.50 0.577350 2.0 2.00 2.5 3.00 3.0

Mean of each regiment’s preTestScore

groupby_regiment.mean()
regiment
Dragoons      15.50
Nighthawks    15.25
Scouts         2.50
Name: preTestScore, dtype: float64

Mean preTestScores grouped by regiment and company

df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
regiment    company
Dragoons    1st         3.5
            2nd        27.5
Nighthawks  1st        14.0
            2nd        16.5
Scouts      1st         2.5
            2nd         2.5
Name: preTestScore, dtype: float64

Mean preTestScores grouped by regiment and company without heirarchical indexing

df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
company 1st 2nd
regiment
Dragoons 3.5 27.5
Nighthawks 14.0 16.5
Scouts 2.5 2.5

Group the entire dataframe by regiment and company

df.groupby(['regiment', 'company']).mean()
preTestScore postTestScore
regiment company
Dragoons 1st 3.5 47.5
2nd 27.5 75.5
Nighthawks 1st 14.0 59.5
2nd 16.5 59.5
Scouts 1st 2.5 66.0
2nd 2.5 66.0

Number of observations in each regiment and company

df.groupby(['regiment', 'company']).size()
regiment    company
Dragoons    1st        2
            2nd        2
Nighthawks  1st        2
            2nd        2
Scouts      1st        2
            2nd        2
dtype: int64

Iterate an operations over groups


/* Group the dataframe by regiment, and for each regiment, */
for name, group in df.groupby('regiment'): 
    /* print the name of the regiment */
    print(name)
    /* print the data of that regiment */
    print(group)
Dragoons
   regiment company    name  preTestScore  postTestScore
4  Dragoons     1st   Cooze             3             70
5  Dragoons     1st   Jacon             4             25
6  Dragoons     2nd  Ryaner            24             94
7  Dragoons     2nd    Sone            31             57
Nighthawks
     regiment company      name  preTestScore  postTestScore
0  Nighthawks     1st    Miller             4             25
1  Nighthawks     1st  Jacobson            24             94
2  Nighthawks     2nd       Ali            31             57
3  Nighthawks     2nd    Milner             2             62
Scouts
   regiment company   name  preTestScore  postTestScore
8    Scouts     1st  Sloan             2             62
9    Scouts     1st  Piger             3             70
10   Scouts     2nd  Riani             2             62
11   Scouts     2nd    Ali             3             70

Group by columns

Specifically in this case: group by the data types of the columns (i.e. axis=1) and then use list() to view what that grouping looks like

list(df.groupby(df.dtypes, axis=1))
[(dtype('int64'),     preTestScore  postTestScore
  0              4             25
  1             24             94
  2             31             57
  3              2             62
  4              3             70
  5              4             25
  6             24             94
  7             31             57
  8              2             62
  9              3             70
  10             2             62
  11             3             70),
 (dtype('O'),       regiment company      name
  0   Nighthawks     1st    Miller
  1   Nighthawks     1st  Jacobson
  2   Nighthawks     2nd       Ali
  3   Nighthawks     2nd    Milner
  4     Dragoons     1st     Cooze
  5     Dragoons     1st     Jacon
  6     Dragoons     2nd    Ryaner
  7     Dragoons     2nd      Sone
  8       Scouts     1st     Sloan
  9       Scouts     1st     Piger
  10      Scouts     2nd     Riani
  11      Scouts     2nd       Ali)]

In the dataframe “df”, group by “regiments, take the mean values of the other variables for those groups, then display them with the prefix_mean

df.groupby('regiment').mean().add_prefix('mean_')
mean_preTestScore mean_postTestScore
regiment
Dragoons 15.50 61.5
Nighthawks 15.25 59.5
Scouts 2.50 66.0

Create a function to get the stats of a group

def get_stats(group):
    return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}

Create bins and bin up postTestScore by those pins


bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)

Apply the get_stats() function to each postTestScore bin

df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
count max mean min
categories
Good 8.0 70.0 63.75 57.0
Great 2.0 94.0 94.00 94.0
Low 2.0 25.0 25.00 25.0
Okay 0.0 NaN NaN NaN

Python Example for Beginners

Two Machine Learning Fields

There are two sides to machine learning:

  • Practical Machine Learning:This is about querying databases, cleaning data, writing scripts to transform data and gluing algorithm and libraries together and writing custom code to squeeze reliable answers from data to satisfy difficult and ill defined questions. It’s the mess of reality.
  • Theoretical Machine Learning: This is about math and abstraction and idealized scenarios and limits and beauty and informing what is possible. It is a whole lot neater and cleaner and removed from the mess of reality.

Data Science Resources: Data Science Recipes and Applied Machine Learning Recipes

Introduction to Applied Machine Learning & Data Science for Beginners, Business Analysts, Students, Researchers and Freelancers with Python & R Codes @ Western Australian Center for Applied Machine Learning & Data Science (WACAMLDS) !!!

Latest end-to-end Learn by Coding Recipes in Project-Based Learning:

Applied Statistics with R for Beginners and Business Professionals

Data Science and Machine Learning Projects in Python: Tabular Data Analytics

Data Science and Machine Learning Projects in R: Tabular Data Analytics

Python Machine Learning & Data Science Recipes: Learn by Coding

R Machine Learning & Data Science Recipes: Learn by Coding

Comparing Different Machine Learning Algorithms in Python for Classification (FREE)

Disclaimer: The information and code presented within this recipe/tutorial is only for educational and coaching purposes for beginners and developers. Anyone can practice and apply the recipe/tutorial presented here, but the reader is taking full responsibility for his/her actions. The author (content curator) of this recipe (code / program) has made every effort to ensure the accuracy of the information was correct at time of publication. The author (content curator) does not assume and hereby disclaims any liability to any party for any loss, damage, or disruption caused by errors or omissions, whether such errors or omissions result from accident, negligence, or any other cause. The information presented here could also be found in public knowledge domains.  

Google –> SETScholars