Running Stats with R | A Full Analysis
I have only completed 2 full marathons so far in my life, and many halves. I love running and enjoy all the benefits that come with it. I was just recently cleared by my doctor to start running again after a long wait on some back issues I was experiencing. So I set a goal of getting back on the road in September. The thing is, I spent half of the month down in Brazil and I noticed a difference in my running pace. While Brazil's temperatures are definitely warmer than Canada in September (average of 26.8 C in Brazil versus 17 C in Canada), I also ran through a neighbourhood that has a lot of hills while down South. So, I decided to run some plots (no pun intended) and some statistical tests on the data using R for everything as a way of sharing with you all the cool things you can do for your Six Sigma projects using this incredible statistical program.
Here's a quick look at the dataset:
And here's my research question: "is there a statistically significant difference in my running pace considering the two countries I ran in September?".
To answer this question, the first thing I wanted to see was a set of different visualizations of the data I have available. To start off, these are the comparative boxplots between the two countries.
I noticed right away that my overall pace in Brazil was slower (more minutes to cover one km), and there's also more variation in my running pace down there.
Moving on, using the same dataset of course, I explored other visualizations such as these comparative density plots:
All of these plots are showing the same slower pace and higher variation in Brazil, each in a slightly different type of visualization.
Then I decided to create I-MR control charts that would distinguish my running pace in both countries, in a more time-wise continuous manner. Here they are:
From these I-MR control charts I noticed a clear "jump" in the central tendency (average time in minutes per km) of my running in Brazil as well as more variation in the data when compared to my running in Canada.
I then performed a 2-sample t test to see if indeed there was a statistically significant difference in my running pace between the two countries. Here's the output from R:
With an extremely low p-value at a 95% confidence level and therefore alpha risk of 5%, I confirmed that there is a statistically significant difference between my running pace in Canada and Brazil (Brazil, from the various previous plots being a slower pace with higher minutes per km measurements).
Finally, and although temperature was not the only factor in my running performance in Brazil, I did look into the relationship between temperature in Celsius and my running pace. Here's the Scatter diagram for it:
And here's the single-predictor linear regression on the data:
Generically speaking, the higher the temperature the slower I was (or said differently, the more minutes I spent to run one km). With a R-square of 0.5704 the relationship is positively moderate and the equation of the best fitted line is Minutes per km = 5.44338 + 0.03949 x Temperature.
The bottom line is: running in Brazil proved to be a lot harder than back home in Canada. I guess with all the Brazilian BBQ, the hills, the higher temperatures, and some nice local brews at the end of the work day I could not have expected anything different.
Comments
Post a Comment