Visualizing bivariate distribution using seaborn

Now we will assign a second variable to y, and the resultant is a bivariate distribution. We will use the same penguins’ dataset here.

In the previous article, all of the examples are related to univariate distributions (distributions of a single variable), perhaps conditional on a second variable assigned to hue.

sns.displot(penguins,x=“bill_length_mm”,  y=“bill_depth_mm")

Output :

Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if there is minimal overlap between the conditional distributions

sns.displot(penguins, x=“bill_length_mm",   y="bill_depth_mm", hue=“species")

Output :

A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color. Similarly, a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian. The default representation then shows the contours of the 2D density

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind=“kde”)

Output :

The contour approach of the bivariate KDE plot lends itself better to evaluating overlap, although a plot with too many contours can get busy

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", hue="species", kind=“kde”)

Output :

Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution.

The same parameters apply, but they can be tuned for each variable by passing a pair of values

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", binwidth=(2, .5))

Output :

To aid interpretation of the heatmap, add a colorbar to show the mapping between counts and color intensity

sns.displot(penguins, x=“bill_length_mm", cbar=True  y="bill_depth_mm", binwidth=(2, .5))

Output :

Visualizing bivariate distribution using seaborn

The meaning of the bivariate density contours is less straightforward. Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it.

The p values are evenly spaced, with the lowest level contolled by the thresh parameter and the number controlled by levels

sns.displot(penguins, x="bill_length_mm",  y="bill_depth_mm", kind="kde", thresh=.2, levels=4)

Output :

The bivariate distribution histogram allows one or both variables to be discrete. Plotting one discrete and one continuous variable offers another way to compare conditional univariate distributions.

sns.displot(df,x=“species”,y=“body_mass_g",hue='sex')

Output :

In contrast, plotting two discrete variables is an easy to way show the cross-tabulation of the observations

sns.displot(df, x="species", y=“island”)  Output :

Visualizing statistical relationships using seaborn

We will discuss three seaborn functions in this tutorial.

• relplot()

• scatterplot()

• lineplot()

As we will see, these functions can be quite illuminating because they use simple and easily-understood representations of data that can nevertheless represent complex dataset structures. They can do so because they plot two-dimensional graphics that can be enhanced by mapping up to three additional variables using the semantics of hue, size, and style.

Relating variables with scatter plots

The scatter plot is a mainstay of statistical visualization. It depicts the joint distribution of two variables using a cloud of points, where each point represents an observation in the dataset. This depiction allows the eye to infer a substantial amount of information about whether there is any meaningful relationship between them.

There are several ways to draw a scatter plot in seaborn. The most basic, which should be used when both variables are numeric, is the scatterplot() function.

In the categorical visualization tutorial, we will see specialized tools for using scatterplots to visualize categorical data. The scatterplot() is the default kind in relplot().

Here we will use the tips dataset from seaborn

df=sns.load_dataset(‘tips')  
df.head()

Output :

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

sns.relplot(x="total_bill", y="tip", data=df)  Output :

While the points are plotted in two dimensions, another dimension can be added to the plot by coloring the points according to a third variable. In seaborn, this is referred to as using a “hue semantic”, because the color of the point gains meaning

sns.relplot(x="total_bill", y=“tip", hue=“smoker",  data=df)

Output :

Unlike with matplotlib.pyplot.scatter(), the literal value of the variable is not used to pick the area of the point. This range can be customized

sns.relplot(x="total_bill", y="tip", size="size",  sizes=(15, 200), data=tips)

Output :

Emphasizing continuity with line plots

Scatter plots are highly effective, but there is no universally optimal type of visualization. Instead, the visual representation should be adapted for the specifics of the dataset and to the question you are trying to answer with the plot.

With some datasets, you may want to understand changes in one variable as a function of time or a similarly continuous variable. In this situation, a good choice is to draw a line plot. In seaborn, this can be accomplished by the lineplot() function, either directly or with relplot() by setting kind=“line”

df=pd.DataFrame(dict(time=np.arange(500),  value=np.random.randn(500).cumsum()))

Output :

Aggregation and representing uncertainty

More complex datasets will have multiple measurements for the same value of the x variable. The default behavior in seaborn is to aggregate the multiple measurements at each x value by plotting the mean and the 95% confidence interval around the mean. We will use “fmri”

dataset for this

df=sns.load_dataset(‘fmri’)

Output :

subject timepoint event region signal

0 s13 18 stim parietal -0.017552

1 s5 14 stim parietal -0.080883

2 s12 18 stim parietal -0.081033

3 s11 18 stim parietal -0.046134

4 s10 18 stim parietal -0.037970

sns.relplot(x="timepoint", y="signal", kind="line",  data=df)

Output :

Another good option, especially with larger data, is to represent the spread of the distribution at each time point by plotting the standard deviation instead of a confidence interval

sns.relplot(x="timepoint", y="signal", kind="line",  ci="sd", data=df);

Output :

Plotting subsets of data with semantic mappings

The lineplot() function has the same flexibility as scatterplot() it can show up to three additional variables by modifying the hue, size, and style of the plot elements.

It does so using the same API as a scatterplot(), meaning that we don’t need to stop and think about the parameters that control the look of lines vs. points in matplotlib.

Using semantics in lineplot() will also determine how the data get aggregated. For example, adding a hue semantic with two levels splits the plot into two lines and error bands, coloring each to indicate which subset of the data they correspond to.

sns.relplot(x="timepoint", y="signal", hue="event",  kind="line", data=df)

Output :

In the next article, we will learn how to plot categorical variables using Seaborn

Leave a Reply Cancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article

DevOps vs Azure DevOps: Best Choice for 2025 Explained

April 24, 2025

Key Benefits of Learning DevSecOps on AWS

April 24, 2025

Top Salesforce Admin Interview Questions in the USA

April 24, 2025

Fast-Track Your Career in Agile with a Scrum Master Certification

April 24, 2025

5 Best Cloud Platforms for Selenium Testing

April 24, 2025

Data Visualization for Business Analysts: Power BI vs Tableau

April 24, 2025

Need a Free Demo Class?

Join H2K Infosys IT Online Training

Enroll Now

Top 30 Python Applications in the Real World

October 11, 2024

What Is a Python Program? Learn the Essentials

October 10, 2024

Python3 Syntax Check: Tips and Tools for Beginners

Master Python3 effortlessly with these essential syntax check tips and beginner-friendly tools!

October 8, 2024

Programming Languages For Data Science

October 4, 2024

Pros and Cons of Python Programming

October 4, 2024

Top 30 r Programming Language Interview Questions and Answers

October 3, 2024

Python vs R: Which Programming Language is Best for Data Science

Python vs R: Best programming Language for Data Science?

October 1, 2024

Top 30 Data Science Intern Interview Questions You Need to Know

October 1, 2024

Data Analyst vs. Web Developer: Which Career Path Is Right for You?

August 12, 2024

What is the difference between Research Analyst vs Data Analyst?

August 5, 2024

Steven Roger

Steven Roger is a technology blogger for the H2K Infosys blog, where he brings complex tech concepts to life with clear, engaging insights. With a passion for IT education and over a decade of industry experience, Steven specializes in demystifying the latest in software development, business analysis, and quality assurance training. His articles provide readers with practical knowledge and tips on upskilling for successful careers in tech.

Read All from Steven Roger