Visualizing categorical data using seaborn

Visualizing categorical data using Seaborn

Table of Contents

In the relational plot tutorial, we saw how to use different visual representations to show the relationship between multiple variables in a  dataset. In the examples, we focused on cases where the main relationship was between two numerical variables. If one of the main variables is categorical ( divided into discrete groups ), it may be helpful to use a more specialized approach to visualization. 

In seaborn, there are several different ways to visualize a relationship involving categorical data. Similar to the relationship between relplot() and either scatterplot() or lineplot(), there are two ways to make these plots. There are several axes-level functions for plotting categorical data in different ways and a figure-level interface, catplot(), that gives unified higher-level access to them. 

It’s helpful to think of the different categorical plot kinds as belonging to three different families, which we’ll discuss in detail below. 

Categorical scatterplots:  

• stripplot() (with kind=”strip”; the default) 

• swarmplot() (with kind=“swarm”) 

Categorical distribution plots:  

• boxplot() (with kind=“box”) 

• violinplot() (with kind=“violin”) 

• boxenplot() (with kind=“boxen”) 

Categorical estimate plots:  

• pointplot() (with kind=“point”) 

• barplot() (with kind=“bar”) 

• countplot() (with kind=”count”)

These families represent the data using different levels of granularity.  When deciding which to use, you’ll have to think about the question  that you want to answer.  

The unified API makes it easy to switch between different kinds and see your data from several perspectives. In this tutorial, we’ll mostly focus on the figure-level interface, catplot(). Remember that this function is a higher-level interface each of the functions above, so we’ll reference them when we show each kind of plot, keeping the more verbose kind-specific API documentation at hand. We will use the tips data set. 

import seaborn as sns  
import matplotlib.pyplot as plt  tips = sns.load_dataset(“tips”) 

[box type=”info” align=”” class=”” width=””]

total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4

[/box]

Categorical scatterplots  

The default representation of the data in catplot() uses a scatterplot.  There are two different categorical scatter plots in seaborn. They take different approaches to resolving the main challenge in representing categorical data with a scatter plot, which is that all of the points belonging to one category would fall on the same position along with the axis corresponding to the categorical variable.  The approach used by strip plot(), which is the default kind in catplot() is to adjust the positions of points on the categorical axis with a small amount of random jitter.

sns.catplot(x="day", y="total_bill", data=tips)
Visualizing categorical data using Seaborn

The jitter parameter controls the magnitude of jitter or disables it  altogether. 

sns.catplot(x="day", y="total_bill", data=tips,  jitter=False)
Visualizing categorical data using Seaborn

The second approach adjusts the points along the categorical axis using an algorithm that prevents them from overlapping. It can give a better representation of the distribution of observations, although it only works well for relatively small datasets. This kind of plot is sometimes called a  bees-warm and is drawn in seaborn by swarmplot(), which is activated by setting kind= “swarm” in catplot(). 

sns.catplot(x="day", y="total_bill", kind="swarm",  data=tips)
Visualizing categorical data using Seaborn

Similar to the relational plots, it’s possible to add another dimension to a categorical plot by using a hue semantic. The categorical plots do not currently support size or style semantics.  

Each different categorical plotting function handles the hue semantic differently. For the scatter plots, it is only necessary to change the color of the points. 

sns.catplot(x="day", y="total_bill", hue=“sex",  kind="swarm", data=tips)
Visualizing categorical data using Seaborn

Unlike with numerical data, it is not always obvious how to order the levels of the categorical variable along its axis. In general, the seaborn categorical plotting functions try to infer the order of categories from the data. If your data have a panda’s Categorical data using Seaborn type, then the default order of the categories can be set there. If the variable passed to the categorical axis looks numerical, the levels will be sorted. But the data are still treated as categorical and drawn at ordinal positions on the categorical axes even when numbers are used to label them. 

sns.catplot(x="size", y="total_bill", data=tips)Visualizing categorical data using Seaborn

The other option for choosing a default ordering is to take the levels of  the category as they appear in the dataset. The ordering can also be  controlled on a plot-specific basis using the order parameter. This can  be important when drawing multiple categorical plots in the same  figure, which we’ll see more of below. 

sns.catplot(x="smoker", y="tip", order=["No", “Yes”],  data=tips)  
Visualizing categorical data using Seaborn

Distributions of observations within categories 

As the size of the dataset grows, categorical scatter plots become  limited in the information they can provide about the distribution of  values within each category. When this happens, there are several  approaches for summarizing the distributional information in ways that  facilitate easy comparisons across the category levels. 

Boxplots  

The first is the familiar boxplot(). This kind of plot shows the three  quartile values of the distribution along with extreme values. The  whiskers extend to points that lie within 1.5 IQRs of the lower and  upper quartile, and then observations that fall outside this range are  displayed independently. This means that each value in the boxplot  corresponds to an actual observation in the data.

sns.catplot(x="day", y="total_bill", kind=“box",  data=tips) 
Visualizing categorical data using Seaborn

When adding a hue semantic, the box for each level of the semantic  variable is moved along the categorical axis so they don’t overlap. 

sns.catplot(x="day", y="total_bill", hue=“smoker",  kind="box", data=tips)
Visualizing categorical data using Seaborn

This behavior is called “dodging” and is turned on by default because it  is assumed that the semantic variable is nested within the main  categorical variable. 

A related function, boxenplot(), draws a plot that is similar to a box  plot but optimized for showing more information about the shape of the  distribution. It is best suited for larger datasets. 

sns.catplot(x="day", y="total_bill", kind=“boxen",  data=tips,hue=‘smoker') 

Visualizing categorical data using SeabornViolinplots  

A violin plot plays a similar role as a box and whisker plot. It shows the  distribution of quantitative data across several levels of one or more  categorical data using seaborn variables such that those distributions can be compared.  Unlike a box plot, in which all of the plot components correspond to  actual datapoints, the violin plot features a kernel density estimation of  the underlying distribution. 

This can be an effective and attractive way to show multiple  distributions of data at once, but keep in mind that the estimation  procedure is influenced by the sample size, and violins for relatively  small samples might look misleadingly smooth.

sns.catplot(x="day", y="total_bill", kind=“violin",  data=tips) 
Visualizing categorical data using Seaborn

This approach uses the kernel density estimate to provide a richer  description of the distribution of values. Additionally, the quartile and  whisker values from the boxplot are shown inside the violin.  

We can also possible to “split” the violins when the hue parameter has  only two levels, which can allow for a more efficient use of space 

sns.catplot(x="day", y="total_bill", kind="violin",  data=tips,hue='smoker') 
Visualizing categorical data using Seaborn

It can also be useful to combine swarmplot() or striplot() with a  box plot or violin plot to show each observation along with a summary of  the distribution 

g = sns.catplot(x="day", y=“total_bill",  kind="violin", inner=None, data=tips)  
sns.swarmplot(x="day", y="total_bill", color=“k",  size=3, data=tips, ax=g.ax) 
Visualizing categorical data using Seaborn

In the next article we will learn how to plot statistical estimation within  categories along with joint and pair plots

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Share this article
Subscribe
By pressing the Subscribe button, you confirm that you have read our Privacy Policy.
Need a Free Demo Class?
Join H2K Infosys IT Online Training
Enroll Free demo class