Interpreting Boxplots and Density Curves

Here's a quick example of how helpful comparative boxplots and density curves can be when it comes to visualizing the behaviour of the data set. I used RStudio to create these visualizations but the most important part of this post is the interpretation of these plots. If you do not use RStudio, feel free to skip to the section where I address the plots themselves. However, if you are an RStudio user you might pick up some coding tips starting at the top of this post. These plots can be created with most statistical software/packages available in the market (I also use Minitab, ProcessMA, and Sigma XL). These are the plots that we are going to look at:



First things first, I've loaded the following libraries in RStudio:


Then I used the rnorm function in R to randomly create three sets of data. These are heights in inches for males, females, and NBA players. I created 100 data points for each variable, and you can see in the screen shot below the mean and standard deviation for each data set. For example, "Male" has an average height of 70 inches, followed by a standard deviation of 4 inches.


I created a data frame with my three variables of interest, then melted the variables into a long form to facilitate the creation of plots using the (awesome) package ggplot2.


The following codes created my comparative boxplots and my density curves.


Ok, let's get to the plots and what exactly they are telling us. First, let's have a look at the comparative boxplots.


Recall the features that boxplots showcase. The bottom whisker represents 25% of the data set (first quartile). The box itself contains 50% of the data (inter-quartile range), the top whisker has the remaining 25% of the data (set at the top of the third quartile, and then upwards). Those little dots you see are the outliers and they are normally calculated as 1.5 times the inter-quartile range. Finally, the bar crossing the box represents the median, the true mid point of the data set.

We can see here that males have a higher median height than females however, when compared to NBA players, they are shorter. Now, the variable NBA player has both males and females of course, so this specific group is not mutually exclusive when compared to males and females at large. For the sake of the exercise here, let's just consider that these were three distinct and mutually exclusive groups of people, chosen randomly for the study. Moving on, we can also see that the tallest female is still shorter than the NBA player's median height. The distribution of the boxes (their thickness) are also a bit different, which makes sense since the standard deviations for males, females, and basketball players are 4, 3.5. and 5 inches, respectively. We can also see that the shortest basketball player is taller than 75% of the women in the study - 75% is represented by the bottom whisker (25%) + the box width which again, contains 50% of the data. And from this point on, you can quickly start to visualize all sorts of comparisons amongst these three groups.

Next, let's have a quick look at the density curves for the same data sets.

 

The very first mistake most people make when looking at density curves is to assume that the height (peak) of each curve represents the highest observation value. Not so much so. Notice that females have the highest peak however they do not carry the highest value (inches in this case) amongst the three datasets. The less variation we find in the data the highest peak we'll find in it - for this example, females have a 3.5 inches standard deviation while males and NBA players have 4 and 5 inches respectively. The height of each curve represents the strength of concentration (or density) around the mean value - less variation. You can see here how each group (yellow for females, green for males, and purple for NBA players) has its peak around the mean value (65, 70, and 80.4 respectively). The shape of the curve is an outcome of the dispersion (spread, variation) of the data. Remember, these are randomly created (normally distributed) data sets. Density curves, especially when compared side by side, or in this case in an overlapping fashion, can very quickly showcase each dataset's measure of central tendency and spread.

For some practice on mean shifting and spread of the data, check out my web app here:



Comments

Popular posts from this blog

The Mathematical Significance of Wisdom Over Time

Binary Logistic Regression for Raccoon Visits to My Backyard

Gage R & R Full Example with ProcessMA