Data Visualization with Seaborn
Seaborn is a Python data visualization library that provides simple code to create elegant visualizations for statistical exploration and insight.
Seaborn is based on Matplotlib which is python library, but improves on Matplotlib in several ways:
- Seaborn provides a more visually appealing plotting style and concise syntax.
- Seaborn natively understands Pandas DataFrames, making it easier to plot data directly from CSV files.
- Seaborn can easily summarize Pandas DataFrames with many rows of data into aggregated charts.
DataFrames contain data structured into rows and columns. DataFrames look similar to other data tables, but they are designed specifically to be used with Python. We can create a DataFrame from a local CSV file (CSV files store data in a tabular format).
To create a DataFrame from a local CSV file you would use the syntax:
The code creates a DataFrame saved to variable named df. The data in df DataFrame comes from local CSV file named file_name.csv. Once we have prepared and organized a pandas DataFrame, we are ready to plot with Seaborn. To import seaborn, use the syntax
Plotting Bars with Seaborn
Consider a dataset of imdb-dataset-sentiment-analysis imported from kaggle in csv format. In seaborn we can create a bar plot using sns.barplot() command. sns.barplot() can take at least three keyword arguments.
- data: a Pandas DataFrame that contains the data
- x: a string that tells Seaborn which column in the DataFrame contains x-labels
- y: a string that tells Seaborn which column in the DataFrame contains the heights we want to plot for each bar
To plot a barplot of the dataframe df:
By default, Seaborn will place error bars on each bar when you use the barplot() function. Error bars are the small lines that extend above and below the top of each bar. Errors bars visually indicate the range of values that might be expected for that bar.
Calculating Different Aggregates
In most cases, we’ll want to plot the mean of our data, but sometimes, we’ll want something different:
- If our data has many outliers, we may want to plot the median.
- If our data is categorical, we might want to count how many times each category appears .
Seaborn is flexible and can calculate any aggregate we want. To do so, we need to use the keyword argument estimator, which accepts any function that works on a list.
For example, to calculate the median, pass in np.median to the estimator keyword:
KDE Plots
Bar plots can tell us what the mean of our dataset is, but they don’t give us any hints as to the distribution of the dataset values. For all we know, the data could be clustered around the mean or spread out evenly across the entire range. To find out more about each of these datasets, we’ll need to examine their distributions. A common way of doing so is by plotting the data as a histogram, but histograms have their drawback as well. Seaborn offers another option for graphing distributions: KDE Plots.
KDE stands for Kernel Density Estimator. A KDE plot gives us the sense of a univariate as a curve. A univariate dataset only has one variable and is also referred to as being one-dimensional, as opposed to bivariate or two-dimensional datasets which have two variables.
To plot a KDE in Seaborn, we use the method sns.kdeplot().
A KDE plot takes the following arguments:
- data — the univariate dataset being visualized, like a Pandas DataFrame, Python list, or NumPy array
- shade — a boolean that determines whether or not the space underneath the curve is shaded
Let’s examine the KDE plots of our csv file:
KDE plots are preferable to histograms because depending on how you group the data into bins and the width of the bins, you can draw wildly different conclusions about the shape of the data. Using a KDE plot can mitigate these issues, because they smooth the datasets, allow us to generalize over the shape of our data, and aren’t beholden to specific data points.
Box Plots
While a KDE plot can tell us about the shape of the data, it’s cumbersome to compare multiple KDE plots at once. They also can’t tell us other statistical information, like the values of outliers.
The box plot (also known as a box-and-whisker plot) can’t tell us about how our dataset is distributed, like a KDE plot. But it shows us the range of our dataset, gives us an idea about where a significant portion of our data lies, and whether or not any outliers are present.
Let’s examine how we interpret a box plot:
- The box represents the interquartile range
- The line in the middle of the box is the median
- The end lines are the first and third quartiles
- The diamonds show outliers
To plot a box plot in Seaborn, we use the method sns.boxplot().
A box plot takes the following arguments:
- data-the dataset we’re plotting, like a DataFrame, list, or an array
- x -a one-dimensional set of values, like a Series, list, or array
- y -a second set of one-dimensional data
If you use a Pandas Series for the x and y values, the Series will also generate the axis labels. For example, if you use the value Series as your y value data, Seaborn will automatically apply that name as the y-axis label.
Violin Plots
Seaborn gives us another option for comparing distributions — a violin plot. Violin plots provide more information than box plots because instead of mapping each individual data point, we get an estimation of the dataset
Violin plots are less familiar and trickier to read, so let’s break down the different parts:
- There are two KDE plots that are symmetrical along the center line.
- A white dot represents the median.
- The thick black line in the center of each violin represents the interquartile range.
- The lines that extend from the center are the confidence intervals — just as we saw on the bar plots, a violin plot also displays the 95% confidence interval.
To plot a violin plot in Seaborn, use the method sns.violinplot().
There are several options for passing in relevant data to the x and y parameters:
- data — the dataset that we’re plotting, such as a list, DataFrame, or array
- x, y, and hue — a one-dimensional set of data, such as a Series, list, or array
- any of the parameters to the function sns.boxplot()