Statistical analysis

Applying Descriptive Statistics using Pandas

Table of Contents

Statistical analysis is used to analyze the results and deduce or infer meaning about the underlying dataset or the reality that it attempts to describe.

Statistical analysis may be used to:

  • Present key findings revealed by a dataset.
  • Summarize information.
  • Calculate measures of cohesiveness, relevance, or diversity in data.
  • Make future predictions based on previously recorded data.
  • Test experimental predictions.

Any data scientist will spend much of their day performing these functions.
Applying Descriptive Statistics using Pandas

Aggregation functions

An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min(),and max(), in which a single number gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays.

.sum() function. It returns a number, the sum of all items in an iterable.

Consider the following DataFrame

A  BC  D
083671738
192728116
254832713

df.sum()

Output:

A   229
B   222
C   125
D    67

dtype: int64

We can also find the sum of  individual columns in the DataFrame 

df[‘A'].sum()

Output:

229

Now we will learn how to find mean(), median(), min(),and max() values. We already know what mean median is

.mean()function returns the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the data frame. If the method is applied on a pandas dataframe object, then the method returns a pandas series object which contains the mean of the values over the specified axis.

df.mean()

Output:

76.33333
74.00000
41.66667
22.33333
dtype:float64

We can also find the mean of  individual columns in the DataFrame 

df['A'].mean()

Output:

76.33333333333333

.median()function return the median of the underlying data in the given DataFrame.

df.median()

Output:

A    83.0
B    72.0
C    27.0
D    16.0
dtype: float64

.max() function returns the maximum of the values in the given object. If the input is a series, the method will return a scalar which will be the maximum of the values in the series. If the input is a dataframe, then the method will return a series with maximum of values over the specified axis in the dataframe. By default the axis is the index axis.

df.max()

Output:

A    92
B    83
C    81
D    38
dtype: int64

Similarly .min()function returns the minimum of the values in the given object. If the input is a series, the method will return a scalar which will be the minimum of the values in the series. If the input is a dataframe, then the method will return a series with minimum of values over the specified axis in the dataframe

df.min()

Output:

A    54
B    67
C    17
D    13
dtype: int64

.count() is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.

df.count()

Output:

A    3
B    3
C    3
D    3
dtype: int64

.std() function return sample standard deviation over requested axis.

df.std()

Output:

A    19.857828
B     8.185353
C    34.428670
D    13.650397
dtype: float64

If we have some missing values in our DataFrame, we skip those missing values by passing the argument  df.std(skipna=True)

.describe()is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of string, it returns a different output which is shown in the examples below.

df.describe()

Output:

               A          B          C          D
    count  3.0000003.0000003.0000003.000000
    mean   76.33333374.00000041.66666722.333333
        std    19.8578288.18535334.42867013.650397
        min    54.00000067.00000017.00000013.000000
25%68.50000069.50000022.00000014.500000
50%83.00000072.00000027.00000016.000000
75%87.50000077.50000054.00000027.000000
        max    92.00000083.00000081.00000038.000000

.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NaN values are automatically excluded. For any non numeric data type columns in the dataframe it is ignored.

df.corr()

Output:

          A         B         C          D
1.000000-0.8582330.5699560.394121
-0.8582331.000000-0.0674210.809964
0.569956-0.0674211.000000-0.530536
0.394121-0.809964-0.5305361.000000

The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, that the correlation of a variable with itself is 1. For that reason all the diagonal values are 1.00.

Importance of GroupBy in Statistical Analysis

Pandas’ GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.GroupBy is used to quickly summarize data and aggregate it in a way that’s easy to interpret.

.groupby() method allows you to group rows of data together and call aggregate functions

Consider the following Data Frame

    Points  Rank    Team  Year
08761Riders  2014
17892Riders  2015
28632Devils  2014
36733Devils  2015
47413Kings  2014
58124kings  2015
67561Kings  2016
77881Kings  2017
86942Riders  2016
97014Royals  2014
108041Royals  2015

Now you can use the .groupby() method to group rows together based on Statistical analysis column name. For instance let’s group based on Team. This will create a DataFrameGroupBy object

df.groupby(‘Team’)

The grouped data is saved in an object 

Output:

<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>

Now we can save this object as a new variable with the name team

team = df.groupby(‘Team’)

Now we can perform all the aggregation functions on that variable

team.mean().astype(‘int32’)

Output:

        Points  Rank  Year
Team                      
Devils  76822014
Kings   76112015
Riders  78612015
Royals  75222014
kings   81242015

We can perform all aggregation functions like mean median mode on this grouped data.

We can also group two columns at the same. This is how you can do it

teams=df.groupby(['Team','Rank'])
teams.mean().astype(‘int32’).   

Output:

            Points  Year
Team   Rank     
Devils 28632014
       36732015
Kings  17722016
       37412014
Riders 18762014
       27412015
Royals 18042015
       47012014
kings  48122015

Filtering and Sorting 

In order to perform Statistical analysis, We should also perform a lot of filtering operations. Pandas provide many methods to Statistical analysis filter a Data frame

We can filter the data using Python Comparison Operators and Python Boolean Operators

Comparison Operators 

These operators compare the values on either sides of them and decide the relation among them. They are also called Relational operators.

==

If the values of two operands are equal, then the condition becomes true.

!=

If values of two operands are not equal, then condition becomes true.

>

If the value of left operand is greater than the value of right operand, then condition becomes true.

<

If the value of left operand is less than the value of right operand, then condition becomes true.

>=

If the value of left operand is greater than or equal to the value of right operand, then condition becomes true.

<=

If the value of left operand is less than or equal to the value of right operand, then condition becomes true.

Boolean Operators

Python has three Boolean operators that are typed out as plain English words:

or

Returns True if one of the statements is true

And

Returns True if both statements are true

not

Reverse the result, returns False if the result is true

Now we will learn how to filter the values of the DataFrame based on certain condition 

Consider the following DataFrame

    Points  Rank    Team  Year
08761Riders  2014
17892Riders  2015
28632Devils  2014
36733Devils  2015
47413Kings  2014
58124kings  2015
67561Kings  2016
77881Kings  2017
86942Riders  2016
97014Royals  2014
108041Royals  2015

If I want to find the names of the team with ranking 1. This is how we can do it

df[df[‘Rank’]==1]

Output:

    Points  Rank    Team  Year
08761Riders  2014
67561Kings  2016
77881Kings  2017
108041Royals  2015

If I want to find the names of the team with ranking greater than 3

df[df[‘Rank’]<3]

Output:

    Points  Rank    Team  Year
08761Riders  2014
17892Riders  2015
28632Devils  2014
67561Kings  2016
77881Kings  2017
86942Riders  2016
108041Royals  2015

We can also check multiple conditions using Boolean operators

Now we will find the teams with ranking 1 and points greater than 800

df[(df[‘Rank’]==1)&(df['Points']>800)]

Output:

    Points  Rank    Team  Year
08761Riders  2014
108041Royals  2015

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class