Visualizing bivariate distributions using seaborn

Visualizing bivariate distribution using seaborn

Table of Contents

Now we will assign a second variable to y, and the resultant is a  bivariate distribution. We will use the same penguins’ dataset here. 

In the previous article, all of the examples are related to univariate distributions (distributions of a single variable), perhaps conditional on a  second variable assigned to hue.  

sns.displot(penguins,x=“bill_length_mm”,  y=“bill_depth_mm")  

Output : 

bivariate distribution

Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if  there is minimal overlap between the conditional distributions 

sns.displot(penguins, x=“bill_length_mm",   y="bill_depth_mm", hue=“species") 

Output : 

Visualizing bivariate distribution using seaborn

A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color. Similarly, a bivariate KDE plot smoothes the (x, y)  observations with a 2D Gaussian. The default representation then  shows the contours of the 2D density 

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind=“kde”) 

Output : 

bivariate distribution

The contour approach of the bivariate KDE plot lends itself better to  evaluating overlap, although a plot with too many contours can get  busy 

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", hue="species", kind=“kde”)  

Output : 

bivariate distribution

Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution.  

The same parameters apply, but they can be tuned for each variable by  passing a pair of values 

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", binwidth=(2, .5)) 

Output : 

bivariate distribution

To aid interpretation of the heatmap, add a colorbar to show the  mapping between counts and color intensity 

sns.displot(penguins, x=“bill_length_mm", cbar=True  y="bill_depth_mm", binwidth=(2, .5))

Output :

Visualizing bivariate distribution using seaborn

The meaning of the bivariate density contours is less straightforward.  Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it.  

The p values are evenly spaced, with the lowest level contolled by the  thresh parameter and the number controlled by levels 

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind="kde", thresh=.2, levels=4)  

Output :  

Visualizing bivariate distribution using seaborn

The bivariate distribution histogram allows one or both variables to be discrete.  Plotting one discrete and one continuous variable offers another way to  compare conditional univariate distributions.

sns.displot(df,x=“species”,y=“body_mass_g",hue='sex') 

Output : 

Visualizing bivariate distribution using seaborn

In contrast, plotting two discrete variables is an easy to way show the  cross-tabulation of the observations 

sns.displot(df, x="species", y=“island”)  Output :
Visualizing bivariate distribution using seaborn

Visualizing statistical relationships using seaborn  

We will discuss three seaborn functions in this tutorial. 

• relplot()  

• scatterplot()  

• lineplot()  

As we will see, these functions can be quite illuminating because they use simple and easily-understood representations of data that can nevertheless represent complex dataset structures. They can do so because they plot two-dimensional graphics that can be enhanced by mapping up to three additional variables using the semantics of hue,  size, and style. 

Relating variables with scatter plots  

The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them. 

There are several ways to draw a scatter plot in seaborn. The most basic, which should be used when both variables are numeric, is the scatterplot() function.  

In the categorical visualization tutorial, we will see specialized tools for using scatterplots to visualize categorical data. The scatterplot() is the default kind in relplot(). 

Here we will use the tips dataset from seaborn 

df=sns.load_dataset(‘tips')  
df.head() 

Output :  

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4
sns.relplot(x="total_bill", y="tip", data=df)  Output : 
bivariate distribution

While the points are plotted in two dimensions, another dimension can be added to the plot by coloring the points according to a third variable.  In seaborn, this is referred to as using a “hue semantic”, because the  color of the point gains meaning 

sns.relplot(x="total_bill", y=“tip", hue=“smoker",  data=df)  

Output : 

Visualizing bivariate distribution using seaborn

Unlike with matplotlib.pyplot.scatter(), the literal value of the variable is not used to pick the area of the point. This range can be  customized 

sns.relplot(x="total_bill", y="tip", size="size",  sizes=(15, 200), data=tips)  

Output : 

Visualizing bivariate distribution using seaborn

Emphasizing continuity with line plots  

Scatter plots are highly effective, but there is no universally optimal type of visualization. Instead, the visual representation should be adapted for the specifics of the dataset and to the question you are trying to answer with the plot. 

With some datasets, you may want to understand changes in one variable as a function of time or a similarly continuous variable. In this situation, a good choice is to draw a line plot. In seaborn, this can be  accomplished by the lineplot() function, either directly or with  relplot() by setting kind=“line”  

df=pd.DataFrame(dict(time=np.arange(500),  value=np.random.randn(500).cumsum()))  

Output :

bivariate distribution

Aggregation and representing uncertainty  

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean. We will use “fmri”  

dataset for this 

df=sns.load_dataset(‘fmri’)  

Output :  

subject timepoint event region signal 

0 s13 18 stim parietal -0.017552 

1 s5 14 stim parietal -0.080883 

2 s12 18 stim parietal -0.081033 

3 s11 18 stim parietal -0.046134 

4 s10 18 stim parietal -0.037970

sns.relplot(x="timepoint", y="signal", kind="line",  data=df)  

Output : 

Visualizing bivariate distribution using seaborn

Another good option, especially with larger data, is to represent the  spread of the distribution at each time point by plotting the standard  deviation instead of a confidence interval 

sns.relplot(x="timepoint", y="signal", kind="line",  ci="sd", data=df);  

Output : 

Visualizing bivariate distribution using seaborn

Plotting subsets of data with semantic mappings 

The lineplot() function has the same flexibility as scatterplot() it can show up to three additional variables by modifying the hue, size,  and style of the plot elements.  

It does so using the same API as a scatterplot(), meaning that we don’t need to stop and think about the parameters that control the look of lines vs. points in matplotlib. 

Using semantics in lineplot() will also determine how the data get aggregated. For example, adding a hue semantic with two levels splits the plot into two lines and error bands, coloring each to indicate which subset of the data they correspond to.

sns.relplot(x="timepoint", y="signal", hue="event",  kind="line", data=df)  

Output : 

Visualizing bivariate distribution using seaborn

In the next article, we will learn how to plot categorical variables using  Seaborn

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class