Now we will assign a second variable to y, and the resultant is a bivariate distribution. We will use the same penguins’ dataset here.
In the previous article, all of the examples are related to univariate distributions (distributions of a single variable), perhaps conditional on a second variable assigned to hue.
sns.displot(penguins,x=“bill_length_mm”, y=“bill_depth_mm")
Output :
Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if there is minimal overlap between the conditional distributions
sns.displot(penguins, x=“bill_length_mm", y="bill_depth_mm", hue=“species")
Output :
A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color. Similarly, a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian. The default representation then shows the contours of the 2D density
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind=“kde”)
Output :
The contour approach of the bivariate KDE plot lends itself better to evaluating overlap, although a plot with too many contours can get busy
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind=“kde”)
Output :
Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution.
The same parameters apply, but they can be tuned for each variable by passing a pair of values
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5))
Output :
To aid interpretation of the heatmap, add a colorbar to show the mapping between counts and color intensity
sns.displot(penguins, x=“bill_length_mm", cbar=True y="bill_depth_mm", binwidth=(2, .5))
Output :
The meaning of the bivariate density contours is less straightforward. Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it.
The p values are evenly spaced, with the lowest level contolled by the thresh parameter and the number controlled by levels
sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", thresh=.2, levels=4)
Output :
The bivariate distribution histogram allows one or both variables to be discrete. Plotting one discrete and one continuous variable offers another way to compare conditional univariate distributions.
sns.displot(df,x=“species”,y=“body_mass_g",hue='sex')
Output :
In contrast, plotting two discrete variables is an easy to way show the cross-tabulation of the observations
sns.displot(df, x="species", y=“island”) Output :
Visualizing statistical relationships using seaborn
We will discuss three seaborn functions in this tutorial.
• relplot()
• scatterplot()
• lineplot()
As we will see, these functions can be quite illuminating because they use simple and easily-understood representations of data that can nevertheless represent complex dataset structures. They can do so because they plot two-dimensional graphics that can be enhanced by mapping up to three additional variables using the semantics of hue, size, and style.
Relating variables with scatter plots
The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.
There are several ways to draw a scatter plot in seaborn. The most basic, which should be used when both variables are numeric, is the scatterplot() function.
In the categorical visualization tutorial, we will see specialized tools for using scatterplots to visualize categorical data. The scatterplot() is the default kind in relplot().
Here we will use the tips dataset from seaborn
df=sns.load_dataset(‘tips') df.head()
Output :
total_bill | tip | sex | smoker | day | time | size | |
0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
sns.relplot(x="total_bill", y="tip", data=df) Output :
While the points are plotted in two dimensions, another dimension can be added to the plot by coloring the points according to a third variable. In seaborn, this is referred to as using a “hue semantic”, because the color of the point gains meaning
sns.relplot(x="total_bill", y=“tip", hue=“smoker", data=df)
Output :
Unlike with matplotlib.pyplot.scatter(), the literal value of the variable is not used to pick the area of the point. This range can be customized
sns.relplot(x="total_bill", y="tip", size="size", sizes=(15, 200), data=tips)
Output :
Emphasizing continuity with line plots
Scatter plots are highly effective, but there is no universally optimal type of visualization. Instead, the visual representation should be adapted for the specifics of the dataset and to the question you are trying to answer with the plot.
With some datasets, you may want to understand changes in one variable as a function of time or a similarly continuous variable. In this situation, a good choice is to draw a line plot. In seaborn, this can be accomplished by the lineplot() function, either directly or with relplot() by setting kind=“line”
df=pd.DataFrame(dict(time=np.arange(500), value=np.random.randn(500).cumsum()))
Output :
Aggregation and representing uncertainty
More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean. We will use “fmri”
dataset for this
df=sns.load_dataset(‘fmri’)
Output :
subject timepoint event region signal
0 s13 18 stim parietal -0.017552
1 s5 14 stim parietal -0.080883
2 s12 18 stim parietal -0.081033
3 s11 18 stim parietal -0.046134
4 s10 18 stim parietal -0.037970
sns.relplot(x="timepoint", y="signal", kind="line", data=df)
Output :
Another good option, especially with larger data, is to represent the spread of the distribution at each time point by plotting the standard deviation instead of a confidence interval
sns.relplot(x="timepoint", y="signal", kind="line", ci="sd", data=df);
Output :
Plotting subsets of data with semantic mappings
The lineplot() function has the same flexibility as scatterplot() it can show up to three additional variables by modifying the hue, size, and style of the plot elements.
It does so using the same API as a scatterplot(), meaning that we don’t need to stop and think about the parameters that control the look of lines vs. points in matplotlib.
Using semantics in lineplot() will also determine how the data get aggregated. For example, adding a hue semantic with two levels splits the plot into two lines and error bands, coloring each to indicate which subset of the data they correspond to.
sns.relplot(x="timepoint", y="signal", hue="event", kind="line", data=df)
Output :
In the next article, we will learn how to plot categorical variables using Seaborn