Interpreting Boxplots and Density Curves
Here's a quick example of how helpful comparative boxplots and density curves can be when it comes to visualizing the behaviour of the data set. I used RStudio to create these visualizations but the most important part of this post is the interpretation of these plots. If you do not use RStudio, feel free to skip to the section where I address the plots themselves. However, if you are an RStudio user you might pick up some coding tips starting at the top of this post. These plots can be created with most statistical software/packages available in the market (I also use Minitab, ProcessMA, and Sigma XL). These are the plots that we are going to look at:
First things first, I've loaded the following libraries in RStudio:
Then I used the rnorm function in R to randomly create three sets of data. These are heights in inches for males, females, and NBA players. I created 100 data points for each variable, and you can see in the screen shot below the mean and standard deviation for each data set. For example, "Male" has an average height of 70 inches, followed by a standard deviation of 4 inches.
The following codes created my comparative boxplots and my density curves.
Ok, let's get to the plots and what exactly they are telling us. First, let's have a look at the comparative boxplots.
Recall the features that boxplots showcase. The bottom whisker represents 25% of the data set (first quartile). The box itself contains 50% of the data (inter-quartile range), the top whisker has the remaining 25% of the data (set at the top of the third quartile, and then upwards). Those little dots you see are the outliers and they are normally calculated as 1.5 times the inter-quartile range. Finally, the bar crossing the box represents the median, the true mid point of the data set.
We can see here that males have a higher median height than females however, when compared to NBA players, they are shorter. Now, the variable NBA player has both males and females of course, so this specific group is not mutually exclusive when compared to males and females at large. For the sake of the exercise here, let's just consider that these were three distinct and mutually exclusive groups of people, chosen randomly for the study. Moving on, we can also see that the tallest female is still shorter than the NBA player's median height. The distribution of the boxes (their thickness) are also a bit different, which makes sense since the standard deviations for males, females, and basketball players are 4, 3.5. and 5 inches, respectively. We can also see that the shortest basketball player is taller than 75% of the women in the study - 75% is represented by the bottom whisker (25%) + the box width which again, contains 50% of the data. And from this point on, you can quickly start to visualize all sorts of comparisons amongst these three groups.
Next, let's have a quick look at the density curves for the same data sets.
For some practice on mean shifting and spread of the data, check out my web app here:
Comments
Post a Comment