# Summarising, Aggregating, and Grouping data in Python Pandas

I’ve recently started using Python’s excellent Pandas library as a data analysis tool, and, while finding the transition from R’s excellent data.table library frustrating at times, I’m finding my way around and finding most things work quite well.

One aspect that I’ve recently been exploring is the task of grouping large data frames by different variables, and applying summary functions on each group. This is accomplished in Pandas using the “groupby()” and “agg()” functions of Panda’s DataFrame objects.

## A Sample DataFrame

In order to demonstrate the effectiveness and simplicity of the grouping commands, we will need some data. For an example dataset, I have extracted my own mobile phone usage records. I analyse this type of data using Pandas during my work on KillBiller. If you’d like to follow along – the full csv file is available here.

The dataset contains 830 entries from my mobile phone log spanning a total time of 5 months. The CSV file can be loaded into a pandas DataFrame using the pandas.DataFrame.from_csv() function, and looks like this:

index | date | duration | item | month | network | network_type |
---|---|---|---|---|---|---|

0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |

1 | 15/10/14 06:58 | 13.000 | call | 2014-11 | Vodafone | mobile |

2 | 15/10/14 14:46 | 23.000 | call | 2014-11 | Meteor | mobile |

3 | 15/10/14 14:48 | 4.000 | call | 2014-11 | Tesco | mobile |

4 | 15/10/14 17:27 | 4.000 | call | 2014-11 | Tesco | mobile |

5 | 15/10/14 18:55 | 4.000 | call | 2014-11 | Tesco | mobile |

6 | 16/10/14 06:58 | 34.429 | data | 2014-11 | data | data |

7 | 16/10/14 15:01 | 602.000 | call | 2014-11 | Three | mobile |

8 | 16/10/14 15:12 | 1050.000 | call | 2014-11 | Three | mobile |

9 | 16/10/14 15:30 | 19.000 | call | 2014-11 | voicemail | voicemail |

10 | 16/10/14 16:21 | 1183.000 | call | 2014-11 | Three | mobile |

11 | 16/10/14 22:18 | 1.000 | sms | 2014-11 | Meteor | mobile |

… | … | … | … | … | … | … |

The main columns in the file are:

**date:**The date and time of the entry**duration:**The duration (in seconds) for each call, the amount of data (in MB) for each data entry, and the number of texts sent (usually 1) for each sms entry.**item:**A description of the event occurring – can be one of call, sms, or data.**month:**The billing month that each entry belongs to – of form ‘YYYY-MM’.**network:**The mobile network that was called/texted for each entry.**network_type:**Whether the number being called was a mobile, international (‘world’), voicemail, landline, or other (‘special’) number.

Phone numbers were removed for privacy. The date column can be parsed using the extremely handy dateutil library.

1 2 3 4 5 6 7 | import pandas as pd import dateutil # Load data from csv file data = pd.DataFrame.from_csv('phone_data.csv') # Convert date from string to date times data['date'] = data['date'].apply(dateutil.parser.parse, dayfirst=True) |

## Summarising the DataFrame

Once the data has been loaded into Python, Pandas makes the calculation of different statistics very simple. For example, mean, max, min, standard deviations and more for columns are easily calculable:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | # How many rows the dataset data['item'].count() Out[38]: 830 # What was the longest phone call / data entry? data['duration'].max() Out[39]: 10528.0 # How many seconds of phone calls are recorded in total? data['duration'][data['item'] == 'call'].sum() Out[40]: 92321.0 # How many entries are there for each month? data['month'].value_counts() Out[41]: 2014-11 230 2015-01 205 2014-12 157 2015-02 137 2015-03 101 dtype: int64 # Number of non-null unique network entries data['network'].nunique() Out[42]: 9 |

The need for custom functions is minimal unless you have very specific requirements. The full range of basic statistics that are quickly calculable and built into the base Pandas package are:

Function | Description |
---|---|

count | Number of non-null observations |

sum | Sum of values |

mean | Mean of values |

mad | Mean absolute deviation |

median | Arithmetic median of values |

min | Minimum |

max | Maximum |

mode | Mode |

abs | Absolute Value |

prod | Product of values |

std | Unbiased standard deviation |

var | Unbiased variance |

sem | Unbiased standard error of the mean |

skew | Unbiased skewness (3rd moment) |

kurt | Unbiased kurtosis (4th moment) |

quantile | Sample quantile (value at %) |

cumsum | Cumulative sum |

cumprod | Cumulative product |

cummax | Cumulative maximum |

cummin | Cumulative minimum |

## Summarising Groups in the DataFrame

There’s further power put into your hands by mastering the Pandas “groupby()” functionality. Groupby essentially splits the data into different groups depending on a variable of your choice. For example, the expression data.groupby('month') will split our current DataFrame by month. The groupby() function returns a GroupBy object, but essentially describes how the rows of the original data set has been split. the GroupBy object .groups variable is a dictionary whose keys are the computed unique groups and corresponding values being the axis labels belonging to each group. For example:

1 2 3 4 5 | data.groupby(['month']).groups.keys() Out[59]: ['2014-12', '2014-11', '2015-02', '2015-03', '2015-01'] len(data.groupby(['month']).groups['2014-11']) Out[61]: 230 |

Functions like max(), min(), mean(), first(), last() can be quickly applied to the GroupBy object to obtain summary statistics for each group – an immensely useful function. This functionality is similar to the dplyr and plyr libraries for R. Different variables can be excluded / included from each summary requirement.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 | # Get the first entry for each month data.groupby('month').first() Out[69]: date duration item network network_type month 2014-11 2014-10-15 06:58:00 34.429 data data data 2014-12 2014-11-13 06:58:00 34.429 data data data 2015-01 2014-12-13 06:58:00 34.429 data data data 2015-02 2015-01-13 06:58:00 34.429 data data data 2015-03 2015-02-12 20:15:00 69.000 call landline landline # Get the sum of the durations per month data.groupby('month')['duration'].sum() Out[70]: month 2014-11 26639.441 2014-12 14641.870 2015-01 18223.299 2015-02 15522.299 2015-03 22750.441 Name: duration, dtype: float64 # Get the number of dates / entries in each month data.groupby('month')['date'].count() Out[74]: month 2014-11 230 2014-12 157 2015-01 205 2015-02 137 2015-03 101 Name: date, dtype: int64 # What is the sum of durations, for calls only, to each network data[data['item'] == 'call'].groupby('network')['duration'].sum() Out[78]: network Meteor 7200 Tesco 13828 Three 36464 Vodafone 14621 landline 18433 voicemail 1775 Name: duration, dtype: float64 |

You can also group by more than one variable, allowing more complex queries.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 | # How many calls, sms, and data entries are in each month? data.groupby(['month', 'item'])['date'].count() Out[76]: month item 2014-11 call 107 data 29 sms 94 2014-12 call 79 data 30 sms 48 2015-01 call 88 data 31 sms 86 2015-02 call 67 data 31 sms 39 2015-03 call 47 data 29 sms 25 Name: date, dtype: int64 # How many calls, texts, and data are sent per month, split by network_type? data.groupby(['month', 'network_type'])['date'].count() Out[82]: month network_type 2014-11 data 29 landline 5 mobile 189 special 1 voicemail 6 2014-12 data 30 landline 7 mobile 108 voicemail 8 world 4 2015-01 data 31 landline 11 mobile 160 .... |

## Multiple Statistics per Group

The final piece of syntax that we’ll examine is the “agg()” function for Pandas. The aggregation functionality provided by the agg() function allows multiple statistics to be calculated per group in one calculation. The syntax is simple, and is similar to that of MongoDB’s aggregation framework.

Instructions for aggregation are provided in the form of a python dictionary. Use the dictionary keys to specify the columns upon which you’d like to operate, and the values to specify the function to run.

For example:

1 2 3 4 | # Group the data frame by month and item and extract a number of stats from each group data.groupby(['month', 'item']).agg({'duration':sum, # find the sum of the durations for each group 'network_type': "count", # find the number of network type entries 'date': 'first'}) # get the first date per group |

The aggregation dictionary syntax is flexible and can be defined before the operation. You can also define functions inline using “lambda” functions to extract statistics that are not provided by the built-in options.

1 2 3 4 5 6 | # Define the aggregation procedure outside of the groupby operation aggregations = { 'duration':'sum', 'date': lambda x: max(x) } data.groupby('month').agg(aggregations) |

The final piece of the puzzle is the ability to rename the newly calculated columns and to calculate multiple statistics from a single column in the original data frame. Such calculations are possible through nested dictionaries, or by passing a list of functions for a column. Our final example calculates multiple values from the duration column and names the results appropriately. Note that the results have multi-indexed column headers.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | # Define the aggregation calculations aggregations = { 'duration': { # work on the "duration" column 'total_duration': 'sum', # get the sum, and call this result 'total_duration' 'average_duration': 'mean', # get mean, call result 'average_duration' 'num_calls': 'count' }, 'date': { # Now work on the "date" column 'max_date': 'max', # Find the max, call the result "max_date" 'min_date': 'min', 'num_days': lambda x: max(x) - min(x) # Calculate the date range per group }, 'network': ["count", "max"] # Calculate two results for the 'network' column with a list } # Perform groupby aggregation by "month", but only on the rows that are of type "call" data[data['item'] == 'call'].groupby('month').agg(aggregations) |

The groupby functionality in Pandas is well documented in the official docs and performs at speeds on a par (unless you have massive data and are picky with your milliseconds) with R’s data.table and dplyr libraries.

If you are interested in another example for practice, I used these same techniques to analyse weather data for this post, and I’ve put “how-to” instructions here.

## shenwei356

it’s cool

## JB

Man, this is gold! Good job!

## evanglerm

Most Intelligent and best blog which Finally solved all syntax for Pandas for me. Thanks a lot.

## Purni

Awesome ! :-B)

## Senthil

This blog answers most of my questions in Pandas!! Thank You!

## evanglerm

I keep Comming to this blog for groupby related things. Very goo Blog/information.

## Chandler

Great post ! Learned so much from your blog. Thanks.

BTW: How to draw a table like you did in out[47] ? It looks nice !

Thank you !

## shanelynn

Hi Chandler, thanks for writing in, great to see that people are reading the blog and finding it interesting.

The output in the diagram is the default layout for Pandas Dataframes when you print them in a “Jupyter notebook”. If you’re not already using notebooks – I would recommend having a look as a great tool to share analysis results and explore things. You should find the install instructions at http://jupyter.org and they also come installed with the “Anaconda” python distribution. https://www.continuum.io/downloads

## Ashley

This is amazing, thank you! One question, once you’ve grouped and aggregated data, how do you select it and filter on it? For example, in your last example, you have a column for count. How would you limit the data in your df to only include counts of above or below a certain number? This is easy for me in SQL, but I haven’t been able to understand how to do this in Pandas.

Thanks!

## shanelynn

Hi Ashley, thanks for the feedback. Filtering in Pandas is pretty easy, I tend to go with logical vectors to filter the data frame. So for example, to filter a data frame “df” on the “count” variable, you can use df[df[‘count’] > 5] or df[df[‘count’] == 10], or you can specify the index separately:

idx = (df[‘count’] > 1) & (df[‘count’] < 10) # get index where count is between 1 and 10. df = df[idx] # Filter the actual data frame.

## Ashley

Thank you very much!

## Marc Weber

Great post!

## Khurram

Love the post, it helped me understand the agg function.

## Marion

Really helpful, thanks a lot for this great post 🙂

## Am

Hi , I tried using the agg() to get mean,std and max for a column in DF , but it gives me an error ‘Series object has no attribute ‘agg’ . Could you help me with that.

## shanelynn

Hey Am, sounds like you are trying to apply the .agg() function to a pandas Series rather than a DataFrame – have a look at the datatype to check before you run the code.

## Marta Karaś

This post is awesome – clearly written, comprehensive enough, with good, short examples. Helped me a lot and arrived in my wiki bookmarks. Thank you!

## Trevor Hughes

Thanks – very helpful – please note typo in the first code block

data = pd.DataFrame.from_csv(‘phone_data.cv’) you have cv instead of csv – took me a few runs to figure it out!

## shanelynn

Great spot Trevor, I’ll fix that up now! Glad you found the post useful, and good luck with your python and pandas learning.

## gongyunrou

really helpful , thanks so much for sharing

## Chala

Excellent blog. Thanks a lot for taking time to put this together and in the process helping many people like me understand Pandas better.

There is lot of material out there on Pandas but I think this is one of the best in terms of explaining stuff with excellent example and clarity.

Great Job!

## shanelynn

Thanks Chala, glad that you found it useful. I think it fits well with the pandas stuff out there – but perhaps with a slant towards a data-science user, which is how I use pandas!

## germ

Very nice write-up. Do you know how to preserve the order of the aggregated columns? They do not show up in the same order as give in the aggregators object.

## shanelynn

Great question, and I don’t know the answer – the columns in the results do appear to be relatively randomly ordered. There appears to be a relationship with whether you have “sub-queries” to the order in the pandas output, but I think you may just have to order them yourself afterwards if order is important.

## Naren

Amazing post and even more amazing color combination for the website….its very unique!!!

## Stanley

Thanks for this amazing tutorial!

What’s the best way to output to a new df after the groupby and aggregate?

## shanelynn

Hi Stanley – you should be able to assign a new variable with the result… e.g. new_df = df.groupby(..).agg(..)

## Manuel Hernandez (@manuelhe)

Awesome work. Thanks a bunch!

## Sarah

Thank you so much for this post! You’ve solved the problem I’d been struggling with for ages due to a misunderstanding about pandas operations. I can finally move on with my project!

## Michael

Thanks for a great post, really useful!

I just had one problem reproducing your last block on defining the aggregation calculation and renaming columns – specifically

‘num_days’: lambda x: max(x) – min(x) # Calculate the date range per group

returned an error

TypeError: unsupported operand type(s) for -: ‘str’ and ‘str’

I am using python 3.5 – dont know if that makes a difference?

## shanelynn

Hey Michael, sounds like the data type for the column “num_days” in your data frame is being loaded as a string. Have a look using data[‘num_days’].dtype and ensure that it’s an integer/float before you run the code. There may be something up with the CSV input data in this case.

## jeril

Very informative

## kisky

Great Blog !!! Very helful.

I have a question .. please help me to clarify.

I have to add a new column in my panda dataframe and needs to copy the records in new column from the other column of same dataframe on condition like df.groupby([id]).first().

I know how to do it in sas but not sure how to do it in pandas:

In sas i can do something like this:

data df;

retain col_1″ “;

set df1;

by id;

if first.id then col_1=col_2;

else col_1=col_1;

run;

## Peter

Nice and helpful!

I’m comparing this to SQL and have 1 question:

How about finding a sum for only 2 longest calls every moth?

Can we use groupby(), agg() and rank() together in one go?

For example, can we use data.groupby(‘month’) and then use [‘duration’].rank(ascending=False)<=2 to find only the 2 longest call durations for every month, and then also agg() those 2 call durations for every month?

## DongMei

Thank you for sharing. The post is very helpful.

I have one question here. In aggregations, how to get quantile .05, .25? I used quantile(.05), but always got error below:

‘SeriesGroupBy’ object has no attribute ‘quantile(.05)

Can someone help to advise the solution?

## shanelynn

Hi DongMei, I think you can work out the quantile using a lambda function. Try something like:

`data.groupby("group_column").agg({"work_column": lambda x: x.quantile(0.05)})`

For two columns in the output, you can drop a level from the column headers using:

`data.columns = data.columns.droplevel(level=0)`

Be careful of your naming conventions though if you do this.

## DongMei

One more question, based on the output, how to change the table format to show only one column header?

## DongMei

The post is very helpful.

I have one question here. In the agg statement, how to get quantile 0.05, 0.25, …?

I tried quantile(.05) in agg, but always got following error:

AttributeError: ‘SeriesGroupBy’ object has no attribute ‘quantile(0.05)’

Can someone help on it? Thank you

## DongMei

Thank you Shanelynn. Truly appreciate for the help. ^-^

The lambda is working well in this way as you suggested.

Yet, as my aggregations are too many, so I defined a dictionary: dict, and then used in agg(dict) function.

Others like mean, std are working well, only for quantile, it always prompt error.

I tried following ways:

1. dict[‘work_column’] = ‘quantile’ – fine, but no 0.05

2. dict[‘work_column’] = ‘quantile(0.05)’ – error

3. dict[‘work_column’] = lambda x: x.quantile(0.05) – error

So if I use this way, how to solve the error?

## DongMei

Sorry, I made a mistake by missing out one “x” in lambda in my python program. Solution 3: dict[‘work_column’] = lambda x: x.quantile(0.05) is working now.

Thank you Shanelynn again for the help there.

## Mallik

I would like to convert spark dataframe to pandas data frame for below calculations. would some one help me here please.I am new to pandas dataframe.

for x in (column name list which is coming from the input parameter)

value = df.withcolumn(x,df[x]*df[column name which is coming from the input parameter]

value1 = df.groupby(column name which is coming from the input parameter).agg(*F.sum(x).alias(‘y_’+x) for x in (column name list which is coming from the input parameter),F.sum(y).alias(yy))

value2 = value1.withcolumn(‘y_’+x,value1[‘y_’+x]/valu1[yy]

## Daisy

Hey shanelynn, I have the same question, it cannot to check the type of the data[‘num_days’], because it is a new variable established in the aggregations. when loading the data at first, the type of the variables (except duration) are all object. Could you kindly help me solve this problems, thanks a lot.

## Marcus

This is really helpful! Thank you for pulling everything together.