Statistical analysis is used to analyze the results and deduce or infer meaning about the underlying dataset or the reality that it attempts to describe.
Statistical analysis may be used to:
- Present key findings revealed by a dataset.
- Summarize information.
- Calculate measures of cohesiveness, relevance, or diversity in data.
- Make future predictions based on previously recorded data.
- Test experimental predictions.
Any data scientist will spend much of their day performing these functions.
Aggregation functions
An essential piece of analysis of large data is efficient summarization: computing aggregations like sum(), mean(), median(), min(),and max(), in which a single number gives insight into the nature of a potentially large dataset. In this section, we’ll explore aggregations in Pandas, from simple operations akin to what we’ve seen on NumPy arrays.
.sum() function. It returns a number, the sum of all items in an iterable.
Consider the following DataFrame
A | B | C | D | |
0 | 83 | 67 | 17 | 38 |
1 | 92 | 72 | 81 | 16 |
2 | 54 | 83 | 27 | 13 |
df.sum()
Output:
A | 229 |
B | 222 |
C | 125 |
D | 67 |
dtype: int64
We can also find the sum of individual columns in the DataFrameÂ
df[‘A'].sum()
Output:
229
Now we will learn how to find mean(), median(), min(),and max() values. We already know what mean median is
.mean()function returns the mean of the values for the requested axis. If the method is applied on a pandas series object, then the method returns a scalar value which is the mean value of all the observations in the data frame. If the method is applied on a pandas dataframe object, then the method returns a pandas series object which contains the mean of the values over the specified axis.
df.mean()
Output:
A | 76.33333 |
B | 74.00000 |
C | 41.66667 |
D | 22.33333 |
dtype: | float64 |
We can also find the mean of individual columns in the DataFrame
df['A'].mean()
Output:
76.33333333333333
.median()function return the median of the underlying data in the given DataFrame.
df.median()
Output:
A 83.0 B 72.0 C 27.0 D 16.0 dtype: float64 |
.max() function returns the maximum of the values in the given object. If the input is a series, the method will return a scalar which will be the maximum of the values in the series. If the input is a dataframe, then the method will return a series with maximum of values over the specified axis in the dataframe. By default the axis is the index axis.
df.max()
Output:
A 92 B 83 C 81 D 38 dtype: int64 |
Similarly .min()function returns the minimum of the values in the given object. If the input is a series, the method will return a scalar which will be the minimum of the values in the series. If the input is a dataframe, then the method will return a series with minimum of values over the specified axis in the dataframe
df.min()
Output:
A 54 B 67 C 17 D 13 dtype: int64 |
.count() is used to count the no. of non-NA/null observations across the given axis. It works with non-floating type data as well.
df.count()
Output:
A 3 B 3 C 3 D 3 dtype: int64 |
.std() function return sample standard deviation over requested axis.
df.std()
Output:
A 19.857828 B 8.185353 C 34.428670 D 13.650397 dtype: float64 |
If we have some missing values in our DataFrame, we skip those missing values by passing the argument df.std(skipna=True)
.describe()is used to view some basic statistical details like percentile, mean, std etc. of a data frame or a series of numeric values. When this method is applied to a series of string, it returns a different output which is shown in the examples below.
df.describe()
Output:
A | B | C | D | |
count | 3.000000 | 3.000000 | 3.000000 | 3.000000 |
mean | 76.333333 | 74.000000 | 41.666667 | 22.333333 |
std | 19.857828 | 8.185353 | 34.428670 | 13.650397 |
min | 54.000000 | 67.000000 | 17.000000 | 13.000000 |
25% | 68.500000 | 69.500000 | 22.000000 | 14.500000 |
50% | 83.000000 | 72.000000 | 27.000000 | 16.000000 |
75% | 87.500000 | 77.500000 | 54.000000 | 27.000000 |
max | 92.000000 | 83.000000 | 81.000000 | 38.000000 |
.corr() is used to find the pairwise correlation of all columns in the dataframe. Any NaN values are automatically excluded. For any non numeric data type columns in the dataframe it is ignored.
df.corr()
Output:
A | B | C | D | |
A | 1.000000 | -0.858233 | 0.569956 | 0.394121 |
B | -0.858233 | 1.000000 | -0.067421 | 0.809964 |
C | 0.569956 | -0.067421 | 1.000000 | -0.530536 |
D | 0.394121 | -0.809964 | -0.530536 | 1.000000 |
The output dataframe can be interpreted as for any cell, row variable correlation with the column variable is the value of the cell. As mentioned earlier, that the correlation of a variable with itself is 1. For that reason all the diagonal values are 1.00.
Importance of GroupBy in Statistical Analysis
Pandas’ GroupBy is a powerful and versatile function in Python. It allows you to split your data into separate groups to perform computations for better analysis.GroupBy is used to quickly summarize data and aggregate it in a way that’s easy to interpret.
.groupby() method allows you to group rows of data together and call aggregate functions
Consider the following Data Frame
Points | Rank | Team | Year | |
0 | 876 | 1 | Riders | 2014 |
1 | 789 | 2 | Riders | 2015 |
2 | 863 | 2 | Devils | 2014 |
3 | 673 | 3 | Devils | 2015 |
4 | 741 | 3 | Kings | 2014 |
5 | 812 | 4 | kings | 2015 |
6 | 756 | 1 | Kings | 2016 |
7 | 788 | 1 | Kings | 2017 |
8 | 694 | 2 | Riders | 2016 |
9 | 701 | 4 | Royals | 2014 |
10 | 804 | 1 | Royals | 2015 |
Now you can use the .groupby() method to group rows together based on Statistical analysis column name. For instance let’s group based on Team. This will create a DataFrameGroupBy object
df.groupby(‘Team’)
The grouped data is saved in an object
Output:
<pandas.core.groupby.DataFrameGroupBy object at 0x113014128>
Now we can save this object as a new variable with the name team
team = df.groupby(‘Team’)
Now we can perform all the aggregation functions on that variable
team.mean().astype(‘int32’)
Output:
Points | Rank | Year | |
Team | |||
Devils | 768 | 2 | 2014 |
Kings | 761 | 1 | 2015 |
Riders | 786 | 1 | 2015 |
Royals | 752 | 2 | 2014 |
kings | 812 | 4 | 2015 |
We can perform all aggregation functions like mean median mode on this grouped data.
We can also group two columns at the same. This is how you can do it
teams=df.groupby(['Team','Rank'])
teams.mean().astype(‘int32’).
Output:
Points | Year | ||
Team | Rank | ||
Devils | 2 | 863 | 2014 |
3 | 673 | 2015 | |
Kings | 1 | 772 | 2016 |
3 | 741 | 2014 | |
Riders | 1 | 876 | 2014 |
2 | 741 | 2015 | |
Royals | 1 | 804 | 2015 |
4 | 701 | 2014 | |
kings | 4 | 812 | 2015 |
Filtering and Sorting
In order to perform Statistical analysis, We should also perform a lot of filtering operations. Pandas provide many methods to Statistical analysis filter a Data frame
We can filter the data using Python Comparison Operators and Python Boolean Operators
Comparison Operators
These operators compare the values on either sides of them and decide the relation among them. They are also called Relational operators.
==
If the values of two operands are equal, then the condition becomes true.
!=
If values of two operands are not equal, then condition becomes true.
>
If the value of left operand is greater than the value of right operand, then condition becomes true.
<
If the value of left operand is less than the value of right operand, then condition becomes true.
>=
If the value of left operand is greater than or equal to the value of right operand, then condition becomes true.
<=
If the value of left operand is less than or equal to the value of right operand, then condition becomes true.
Boolean Operators
Python has three Boolean operators that are typed out as plain English words:
or
Returns True if one of the statements is true
And
Returns True if both statements are true
not
Reverse the result, returns False if the result is true
Now we will learn how to filter the values of the DataFrame based on certain condition
Consider the following DataFrame
Points | Rank | Team | Year | |
0 | 876 | 1 | Riders | 2014 |
1 | 789 | 2 | Riders | 2015 |
2 | 863 | 2 | Devils | 2014 |
3 | 673 | 3 | Devils | 2015 |
4 | 741 | 3 | Kings | 2014 |
5 | 812 | 4 | kings | 2015 |
6 | 756 | 1 | Kings | 2016 |
7 | 788 | 1 | Kings | 2017 |
8 | 694 | 2 | Riders | 2016 |
9 | 701 | 4 | Royals | 2014 |
10 | 804 | 1 | Royals | 2015 |
If I want to find the names of the team with ranking 1. This is how we can do it
df[df[‘Rank’]==1]
Output:
Points | Rank | Team | Year | |
0 | 876 | 1 | Riders | 2014 |
6 | 756 | 1 | Kings | 2016 |
7 | 788 | 1 | Kings | 2017 |
10 | 804 | 1 | Royals | 2015 |
If I want to find the names of the team with ranking greater than 3
df[df[‘Rank’]<3]
Output:
Points | Rank | Team | Year | |
0 | 876 | 1 | Riders | 2014 |
1 | 789 | 2 | Riders | 2015 |
2 | 863 | 2 | Devils | 2014 |
6 | 756 | 1 | Kings | 2016 |
7 | 788 | 1 | Kings | 2017 |
8 | 694 | 2 | Riders | 2016 |
10 | 804 | 1 | Royals | 2015 |
We can also check multiple conditions using Boolean operators
Now we will find the teams with ranking 1 and points greater than 800
df[(df[‘Rank’]==1)&(df['Points']>800)]
Output:
Points | Rank | Team | Year | |
0 | 876 | 1 | Riders | 2014 |
10 | 804 | 1 | Royals | 2015 |