Applying Descriptive Statistics using Pandas

Statistical analysis is used to analyze the results and deduce or infer meaning about the underlying dataset or the reality that it attempts to describe.

Statistical analysis may be used to:

Present key findings revealed by a dataset.
Summarize information.
Calculate measures of cohesiveness, relevance, or diversity in data.
Make future predictions based on previously recorded data.
Test experimental predictions.

Any data scientist will spend much of their day performing these functions.

Aggregation functions

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min(),and max(), in which a single number gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays.

.sum() function. It returns a number, the sum of all items in an iterable.

Consider the following DataFrame

	A	B	C	D
0	83	67	17	38
1	92	72	81	16
2	54	83	27	13

df.sum()

Output:

A	229
B	222
C	125
D	67

dtype: int64

We can also find the sum of individual columns in the DataFrame

df[‘A'].sum()

Output:

229

Now we will learn how to find mean(), median(), min(),and max() values. We already know what mean median is

.mean()function returns the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the data frame. If the method is applied on a pandas dataframe object, then the method returns a pandas series object which contains the mean of the values over the specified axis.

df.mean()

Output:

A	76.33333
B	74.00000
C	41.66667
D	22.33333
dtype:	float64

We can also find the mean of individual columns in the DataFrame

df['A'].mean()

Output:

76.33333333333333

.median()function return the median of the underlying data in the given DataFrame.

df.median()

Output:

A 83.0
B 72.0
C 27.0
D 16.0
dtype: float64

.max() function returns the maximum of the values in the given object. If the input is a series, the method will return a scalar which will be the maximum of the values in the series. If the input is a dataframe, then the method will return a series with maximum of values over the specified axis in the dataframe. By default the axis is the index axis.

df.max()

Output:

A 92
B 83
C 81
D 38
dtype: int64

Similarly .min()function returns the minimum of the values in the given object. If the input is a series, the method will return a scalar which will be the minimum of the values in the series. If the input is a dataframe, then the method will return a series with minimum of values over the specified axis in the dataframe

df.min()

Output:

A 54
B 67
C 17
D 13
dtype: int64

.count() is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.

df.count()

Output:

A 3
B 3
C 3
D 3
dtype: int64

.std() function return sample standard deviation over requested axis.

df.std()

Output:

A 19.857828
B 8.185353
C 34.428670
D 13.650397
dtype: float64

If we have some missing values in our DataFrame, we skip those missing values by passing the argument df.std(skipna=True)

.describe()is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of string, it returns a different output which is shown in the examples below.

df.describe()

Output:

	A	B	C	D
count	3.000000	3.000000	3.000000	3.000000
mean	76.333333	74.000000	41.666667	22.333333
std	19.857828	8.185353	34.428670	13.650397
min	54.000000	67.000000	17.000000	13.000000
25%	68.500000	69.500000	22.000000	14.500000
50%	83.000000	72.000000	27.000000	16.000000
75%	87.500000	77.500000	54.000000	27.000000
max	92.000000	83.000000	81.000000	38.000000

.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NaN values are automatically excluded. For any non numeric data type columns in the dataframe it is ignored.

df.corr()

Output:

	A	B	C	D
A	1.000000	-0.858233	0.569956	0.394121
B	-0.858233	1.000000	-0.067421	0.809964
C	0.569956	-0.067421	1.000000	-0.530536
D	0.394121	-0.809964	-0.530536	1.000000

The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, that the correlation of a variable with itself is 1. For that reason all the diagonal values are 1.00.

Importance of GroupBy in Statistical Analysis

Pandas’ GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.GroupBy is used to quickly summarize data and aggregate it in a way that’s easy to interpret.

.groupby() method allows you to group rows of data together and call aggregate functions

Consider the following Data Frame

	Points	Rank	Team	Year
0	876	1	Riders	2014
1	789	2	Riders	2015
2	863	2	Devils	2014
3	673	3	Devils	2015
4	741	3	Kings	2014
5	812	4	kings	2015
6	756	1	Kings	2016
7	788	1	Kings	2017
8	694	2	Riders	2016
9	701	4	Royals	2014
10	804	1	Royals	2015

Now you can use the .groupby() method to group rows together based on Statistical analysis column name. For instance let’s group based on Team. This will create a DataFrameGroupBy object

df.groupby(‘Team’)

The grouped data is saved in an object

Output:

<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>

Now we can save this object as a new variable with the name team

team = df.groupby(‘Team’)

Now we can perform all the aggregation functions on that variable

team.mean().astype(‘int32’)

Output:


	Points	Rank	Year
Team
Devils	768	2	2014
Kings	761	1	2015
Riders	786	1	2015
Royals	752	2	2014
kings	812	4	2015

We can perform all aggregation functions like mean median mode on this grouped data.

We can also group two columns at the same. This is how you can do it

teams=df.groupby(['Team','Rank']) teams.mean().astype(‘int32’).

Output:

		Points	Year
Team	Rank
Devils	2	863	2014
	3	673	2015
Kings	1	772	2016
	3	741	2014
Riders	1	876	2014
	2	741	2015
Royals	1	804	2015
	4	701	2014
kings	4	812	2015

Filtering and Sorting

In order to perform Statistical analysis, We should also perform a lot of filtering operations. Pandas provide many methods to Statistical analysis filter a Data frame

We can filter the data using Python Comparison Operators and Python Boolean Operators

Comparison Operators

These operators compare the values on either sides of them and decide the relation among them. They are also called Relational operators.

==

If the values of two operands are equal, then the condition becomes true.

!=

If values of two operands are not equal, then condition becomes true.

>

If the value of left operand is greater than the value of right operand, then condition becomes true.

<

If the value of left operand is less than the value of right operand, then condition becomes true.

>=

If the value of left operand is greater than or equal to the value of right operand, then condition becomes true.

<=

If the value of left operand is less than or equal to the value of right operand, then condition becomes true.

Boolean Operators

Python has three Boolean operators that are typed out as plain English words:

Returns True if one of the statements is true

And

Returns True if both statements are true

not

Reverse the result, returns False if the result is true

Now we will learn how to filter the values of the DataFrame based on certain condition

Consider the following DataFrame

	Points	Rank	Team	Year
0	876	1	Riders	2014
1	789	2	Riders	2015
2	863	2	Devils	2014
3	673	3	Devils	2015
4	741	3	Kings	2014
5	812	4	kings	2015
6	756	1	Kings	2016
7	788	1	Kings	2017
8	694	2	Riders	2016
9	701	4	Royals	2014
10	804	1	Royals	2015

If I want to find the names of the team with ranking 1. This is how we can do it

df[df[‘Rank’]==1]

Output:

	Points	Rank	Team	Year
0	876	1	Riders	2014
6	756	1	Kings	2016
7	788	1	Kings	2017
10	804	1	Royals	2015

If I want to find the names of the team with ranking greater than 3

df[df[‘Rank’]<3]

Output:

	Points	Rank	Team	Year
0	876	1	Riders	2014
1	789	2	Riders	2015
2	863	2	Devils	2014
6	756	1	Kings	2016
7	788	1	Kings	2017
8	694	2	Riders	2016
10	804	1	Royals	2015

We can also check multiple conditions using Boolean operators

Now we will find the teams with ranking 1 and points greater than 800

df[(df[‘Rank’]==1)&(df['Points']>800)]

Output:

	Points	Rank	Team	Year
0	876	1	Riders	2014
10	804	1	Royals	2015

Aggregation functions, Descriptive Statistics using Pandas, Statistical analysis

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article

What is Agile Scrum Training?

March 29, 2025

Quick Guide to Website Automation with Selenium

March 28, 2025

Scrum Training: Essential for Modern Business Success

March 27, 2025

How Much Does Selenium license Cost?

March 26, 2025

Mastering the Role: Essential Skills Every Professional Scrum Master Should Have

March 25, 2025

How Can Salesforce Admin Certification Boost Your Career?

March 24, 2025

Need a Free Demo Class?

Join H2K Infosys IT Online Training

Enroll Now

Top 30 Python Applications in the Real World

October 11, 2024

What Is a Python Program? Learn the Essentials

October 10, 2024

Python3 Syntax Check: Tips and Tools for Beginners

Master Python3 effortlessly with these essential syntax check tips and beginner-friendly tools!

October 8, 2024

Programming Languages For Data Science

October 4, 2024

Pros and Cons of Python Programming

October 4, 2024

Top 30 r Programming Language Interview Questions and Answers

October 3, 2024

Python vs R: Which Programming Language is Best for Data Science

Python vs R: Best programming Language for Data Science?

October 1, 2024

Top 30 Data Science Intern Interview Questions You Need to Know

October 1, 2024

Data Analyst vs. Web Developer: Which Career Path Is Right for You?

August 12, 2024

What is the difference between Research Analyst vs Data Analyst?

August 5, 2024

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All from Steven Roger