Visualizing UCI Hourrecord attempt

Last year the UCI (international cycling union) changed the rules for the World Hourrecord. This change sparked a lot of interest from great riders in the cycling community. The goal of the hourrecord is to ride the farthest distance in one hour. This is usually done on a Velodrome, a wooden track of (normally) 250m.
Since the rule change there have been several attempts at the records, including that of Jack Bobridge. After his attempt I got a hold of his laptimes and I thought I could make a nice visualization with it.

The dataset contains the lapnumber, lapname and laptime. Knowing the track length (250m) I could then add variables like average laptime and average speed. Knowing the current records (51852m) I could add other interesting variables like the average laptime for the record and required laptimes. This last one is particularity interesting, more on that later.

hourrecord <- 51852
lapdistance <- 250

allhourdata <- allhourdata %>% mutate(Actual.laptimes.cummulative = cumsum(Actual.laptimes)) %>%
        mutate(Average.actual.laptimes = Actual.laptimes.cummulative/Id) %>%
        mutate(Target.laptimes=(60*60/(hourrecord/lapdistance))) %>%
        mutate(Target.laptimes.cummulative = cumsum(Target.laptimes)) %>%
        mutate(Difference = Actual.laptimes - Target.laptimes) %>%
        mutate(Difference.cummulative = cumsum(Difference)) %>%
        mutate(Projected.distance = (Id*lapdistance)/Actual.laptimes.cummulative*3.6) %>%
        mutate(Required.laptimes.for.record =
                       ((60*60)-Actual.laptimes.cummulative)/((hourrecord-(lapdistance*Id))/lapdistance)) %>%
        mutate(Required.speed = 250/Required.laptimes.for.record*3.6) %>%
        mutate(laptime.to.slow = ifelse(Actual.laptimes < Required.laptimes.for.record, 0, Actual.laptimes))

I started of designing a static visualization but soon realized an animation would far better suit the dataset and the activity the rider performed. I needed to make a loop to create a plot for each lap. I used two variables so I could generate just a few plots when needed.

begin_ani <- 1
end_ani <- nrow(allhourdata)

for (i in begin_ani:end_ani){
    ## Code below comes here
}

The first part of the loop was to select the data to create one plot for that specific lap. I wanted to show the laptimes he had done up until that lap and show the status of the recordattempt in that lap. I wanted to show the following metrics:

  • Points for each laptime
  • A line representing the current record
  • A line representing the current average laptime/speed
  • A line representing the laptimes/speed needed for the remaining laps to beat the record

To do that I filtered the dataset to get the data up until that lap (for the points) and selected the variables for that lap and create new data frames to display the lines. The color coding I added to be able to show if he was on track to beat the record( green) or not (red).

var <- allhourdata[i,]
hourdata <- allhourdata %>% filter(Id <= i)

avg_color <- ifelse(var[,10]<hourrecord/1000, "#800000", "#006600")
hline_avg <- data.frame(Id=1:206, avg=as.numeric(var[,5]))
hline_req <- data.frame(Id=1:206, req=as.numeric(var[,11]))
hline_rec <- data.frame(Id=1:206, rec=as.numeric(var[,6]))

Now the interesting part, creating the plot. I’m a big fan of ggplot2 because of the countless possibilities to customize. I did borrow some of the layout from examples on the internet.

Step 1: Setting up the plot with basic themes, colors and margins.

gg <- ggplot()

gg <- gg + theme_bw() +
        theme(panel.background=element_rect(fill="#F0F0F0")) +
        theme(plot.background=element_rect(fill="#F0F0F0")) +
        theme(panel.border=element_rect(colour="#F0F0F0")) +
        theme(panel.grid.major=element_line(colour="#D0D0D0",size=.5)) 

gg <- gg + ggtitle(paste("Hourrecord attempt Jack Bobridge: ", as.character(var[,2]))) +
        theme(plot.title=element_text(face="bold",hjust=-.08,vjust=2,colour="#3C3C3C",size=20)) +
        theme(legend.position="none") +
        ylab("Lap time") + xlab("Lap") +
        theme(axis.text.x=element_text(size=11,colour="#535353",face="bold")) +
        theme(axis.text.y=element_text(size=11,colour="#535353",face="bold")) +
        theme(axis.title.y=element_text(size=11,colour="#535353",face="bold",vjust=1.5)) +
        theme(axis.title.x=element_text(size=11,colour="#535353",face="bold",vjust=-.5))

gg <- gg + theme(plot.margin = unit(c(1.5, 1.5, 1, 1), "cm"))

Step 2: Adding the laptimes. I also added an annotation with the laptime.

## Laptimes
gg <- gg + geom_point(data=hourdata, aes(x=Id, y=Actual.laptimes), colour="#000099", size=3) +
        annotate("text",x=as.numeric(var[,1])+2,y=as.numeric(var[,3]),label=round(as.numeric(var[,3]),digits=3),colour="#000099", hjust = 0)

Step 3: Adding the lines. Each lines is coded separately.

## Required Laptimes
gg <- gg + geom_line(data=hline_req, aes(x=Id, y=req), colour=avg_color, size=1, alpha=1/2) +
        annotate("text",x=206+2,y=as.numeric(var[,11]),label=round(var[,12],digits=3),colour=avg_color, hjust = 0, alpha=1/2) +
        annotate("text",x=0,y=as.numeric(var[,11]),label=round(var[,11],digits=3),colour=avg_color, hjust = 1, alpha=1/2) +
        annotate("text",x=206,y=as.numeric(var[,11])+0.075,label="Required for record",colour=avg_color, hjust = 1, alpha=1/2)

## Current Record line
gg <- gg + geom_line(data=hline_rec, aes(x=Id, y=rec),colour="#909090", size=1, linetype=2) +
        annotate("text",x=206+2,y=as.numeric(var[,6]),label="51.852",colour="#909090", hjust = 0) +
        annotate("text",x=0,y=as.numeric(var[,6]),label=round(var[,6],digits=3),colour="#909090", hjust = 1)

## Average speed/laptime line
gg <- gg + geom_line(data=hline_avg, aes(x=Id, y=avg),colour=avg_color, size=1, alpha=1/2) +
        annotate("text",x=206+2,y=as.numeric(var[,5]),label=round(var[,10],digits=3),colour=avg_color, hjust = 0, alpha=1/2) +
        annotate("text",x=0,y=as.numeric(var[,5]),label=round(var[,5],digits=3),colour=avg_color, hjust = 1, alpha=1/2) +
        annotate("text",x=1,y=as.numeric(var[,5])+0.075,label="Average",colour=avg_color, hjust = 0, alpha=1/2)

Step 4: Changing the scale of the plot. In other words, zooming in on the range where most of the laptimes are. I’ve zoomed in to -5 laps to create some room for the laptimes.

gg <- gg + scale_x_continuous(minor_breaks=0,breaks=seq(0,200,50),limits=c(-5,216)) +
        scale_y_continuous(minor_breaks=0,breaks=seq(16, 18.5, 0.5)) + #,limits=c(16, 18.5)) +
        coord_cartesian(ylim=c(15.9, 18.6)) + ## zoom only, no cutoff data
        theme(axis.ticks=element_blank()) 

Step 5: Show and save the laptimes.

print(gg)
ggsave(file=paste("pr",i,".jpg", sep=""))

Step 6: Here you could add code to create the animated GIF. I tried it but ran out of memory on my PC, so I did it by hand.

files = sprintf('pr%d.png', begin_ani:end_ani)
im.convert(files, output = 'hourrecord.gif')

The result is the animated gif below. I’m pleased with the result (I ‘m sure the code can be better).
The general opinion after the record attempt was that it was a close call. But in the animation you can see that it all lost at 3/4 of the hour. The important variable in this is the required laptime, that is what he needs to ride to make it to the record. He does get to that laptime from around 100 laps. He did a great effort and was probably seriously hurting and that is what people saw and probably what made them think he was still able to make it until a few minutes before the end. But the data shows that was not the case.

Update 9-2-2015: Yesterday another attempt was made by Rohan Dennis. He made it and set a new World Record wth a distance of 52.491 meters.

You can find the dataset and code on GitHub.

Multivariate regression in R

I found a very nice article on the internet that explained how to develop a multivariate regression model. In the article a dataset for several hundred 2005 used General Motors (GM) cars is supplied which is the basis for the step by step explanation. I think this is a good way to learn about regression models. I thought it was a good idea to try to reproduce the result, intercepts, slopes and plots in the article using R.

This (my) article will just show how to reproduce the results in the for-mentioned article using R. I have also added two other ways to get the best fit using the power of R.

The article uses 4 equations to develop the model into the best fit.

  • Equation 1: Based on Mileage alone
  • Equation 2: Based on Mileage, Cylinder, Doors, Cruise, Sound, Leather
  • Equation 3: Based on Mileage, Buick, Cadillac, Chev, Pontiac, SAAB predicting the log of price
  • Equation 4: Based on Mileage, Liter, Doors, Cruise, Sound, Leather, Buick, Cadillac, Chev, Pontiac, SAAB, Conv, Hatchback, Sedan predicting the log of price

Before starting modelling the regressions with R first we need to read the data and setup the required libraries.

library("tidyr")
library("dplyr")

kuiper <- read.csv("kuiper.csv", sep=";", dec=",")

Equation 1

The first equation is simple and just to explore:

model <- lm(Price ~ Mileage, data=kuiper)

Result:

## (Intercept) 24764.5590
## Mileage        -0.1725
## ---
## Multiple R-squared:  0.0205, Adjusted R-squared:  0.0192 

To create the plot of the model use this code:

plot(model$residuals, col="blue", type="l")
points(model$residuals, col="red", type="p", pch=20)
abline(h=0)

unnamed-chunk-4-1

Equation 2
Still a quite simple model, just with more predictors.

model_acc <- lm(Price ~ Mileage + Cylinder +
                Doors + Cruise + Sound + Leather, data=kuiper)
## (Intercept)  7323.1643
## Mileage        -0.1705
## Cylinder     3200.1246
## Doors       -1463.3991
## Cruise       6205.5113
## Sound       -2024.4007
## Leather      3327.1433
## ---
## Multiple R-squared:  0.446,  Adjusted R-squared:  0.442 

If we plot the model we get 4 plots which explain different aspects of the model.

par(mfrow=c(2,2))
plot(model_acc)
par(mfrow=c(1,1))

unnamed-chunk-8-1

Fot the histogram and distribution of the residuals we need different plots.

hist(model_acc$residuals, breaks=30)

plot(model_acc$residuals, col="blue", type="l")
points(model_acc$residuals, col="red", type="p", pch=20)
abline(h=0)

unnamed-chunk-9-1

unnamed-chunk-10-1

Equation 3

For this equation we first need to alter the dataset to add the factor variable Make as boolean variables and take the log of price.

kuiper$id <- 1:nrow(kuiper)
make <- as.data.frame(table(kuiper$id, kuiper$Make))
make <- make %>% spread(Var2, Freq) %>% rename(id = Var1)
kuiper <- merge(kuiper, make, by="id")
kuiper <- select(kuiper, -Make)

kuiper$tPrice <- log10(kuiper$Price)
model_make <- lm(tPrice ~ Mileage + Liter +
             Buick + Cadillac + Chevrolet +
             Pontiac + SAAB, data=kuiper)
## (Intercept)  3.979911512
## Mileage     -0.000003476
## Liter        0.099724553
## Buick        0.039968616
## Cadillac     0.249302587
## Chevrolet   -0.009372280
## Pontiac      0.013613268
## SAAB         0.345305253
## ---
## Multiple R-squared:  0.917,  Adjusted R-squared:  0.916 

Same plot as before.

par(mfrow=c(2,2))
plot(model_make)
par(mfrow=c(1,1))

unnamed-chunk-13-1

Histogram and residuals.

hist(model_make$residuals, breaks=30)
plot(model_make$residuals, col="blue", type="l")
points(model_make$residuals, col="red", type="p", pch=20)
abline(h=0)

unnamed-chunk-14-1

unnamed-chunk-15-1

Equation 4
For equation 4 the dataset needs to be altered for the Type variable.

type <- as.data.frame(table(kuiper$id, kuiper$Type))
type <- type %>% spread(Var2, Freq) %>% rename(id = Var1)
kuiper <- merge(kuiper, type, by="id")
kuiper <- select(kuiper, -Type)
model_all <- lm(tPrice ~ Mileage + Liter + Doors +
                Cruise + Sound + Leather + Buick +
                Cadillac + Chevrolet + Pontiac + SAAB +
                Convertible + Hatchback + Sedan, data=kuiper)
## (Intercept)  3.91810880
## Mileage     -0.00000358
## Liter        0.09576214
## Doors        0.03352658
## Cruise       0.00751699
## Sound        0.00522264
## Leather      0.00625998
## Buick        0.04165259
## Cadillac     0.23303437
## Chevrolet   -0.01331520
## Pontiac     -0.00042065
## SAAB         0.28109769
## Convertible  0.13781907
## Hatchback   -0.08898857
## Sedan       -0.07114898
## ---
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.951 

Again the same plots.

par(mfrow=c(2,2))
plot(model_all)
par(mfrow=c(1,1))

hist(model_all$residuals, breaks=30)
plot(model_all$residuals, col="blue", type="l")
points(model_all$residuals, col="red", type="p", pch=20)
abline(h=0)

unnamed-chunk-18-1

unnamed-chunk-19-1

unnamed-chunk-20-1

Let R find the best model

R has the Step function. You can supply two models and R steps through the model to get to the best fit. I’ve removed the variables I don’t want to use for predicting.

kuiper <- kuiper %>% select(-Price, -id, -Model, -Trim)

model <- lm(tPrice ~ Mileage, data=kuiper)
model_all <- lm(tPrice ~ ., data=kuiper)
bestfit <- step(model, scope=list(lower=model, upper=model_all), direction="forward")

Result:

## (Intercept)  4.006511863
## Mileage     -0.000003579
## Liter        0.109135503
## SAAB         0.276731601
## Cadillac     0.241903489
## Convertible  0.142481573
## Wagon        0.069803273
## Buick        0.041279545
## Hatchback   -0.013685317
## Chevrolet   -0.014330497
## Cylinder    -0.012359839
## Cruise       0.008355646
## Leather      0.005336807
## Sound        0.004812547
## ---
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.952 

This model is slightly different and produces almost the same R-squared.

Machine learning

Using the Caret package we can use machine learning to predict the best model on the dataset.

library("caret")

ml_model <- train(tPrice ~ ., data=kuiper, method="lm")
## (Intercept)  3.93932861
## Mileage     -0.00000358
## Cylinder    -0.01224029
## Liter        0.10880574
## Doors        0.03377855
## Cruise       0.00808055
## Sound        0.00462693
## Leather      0.00517194
## Buick        0.04375913
## Cadillac     0.24448881
## Chevrolet   -0.01264695
## Pontiac      0.00261046
## SAAB         0.27901919
## Saturn               NA
## Convertible  0.14078278
## Coupe                NA
## Hatchback   -0.08269889
## Sedan       -0.06948186
## Wagon                NA
## ---
## Multiple R-squared:  0.952,  Adjusted R-squared:  0.951 

Remarks

I would like to add that normally you would not use the complete dataset to fit a model for prediction. You would want to split the dataset into a training and testset to be able to cross validate your model to prevent over fitting of the model.

Predictive Analytics with R

I did a project for a data scientist training on Coursera. Using machine learning techniques in R our goal was to predict which exercise subjects did using sensor data.

Using devices such as Jawbone UpNike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways:

  • exactly according to the specification (Class A)
  • throwing the elbows to the front (Class B)
  • lifting the dumbbell only halfway (Class C)
  • lowering the dumbbell only halfway (Class D)
  • throwing the hips to the front (Class E)

After creating a model the goal was to predict how 20 subjects performed the barbell lift. If you use the predict function R gives you (in this case) 20 answers (A to E). (bestfit is the model, pml.submission is the dataset with 20 subjects)

final_predictions <- predict(bestfit, pml.submission)
final_predictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

But what R does is give you the answer with the highest probability. Because it calculates a probability for each class for every subject. If you add type=”prob” to your predict call you get the probability for each class. I’ve used ggplot2 to visualize the probability distribution for the classes. It clearly shows that the model prediction varies over the subjects.

predprob <- predict(bestfit, pml.submission, type = "prob")
predprob$testcase <- 1:nrow(predprob)
predprob <- gather(predprob, "class", "prob", 1:5)
ggplot(predprob, aes(testcase, class)) +
        geom_tile(aes(fill = prob), colour = "white") +
        geom_text(aes(fill = prob, label = round(prob, 2)), size=3, colour="grey25") +
        scale_fill_gradient(low = "white", high = "red") +
        scale_x_discrete(expand = c(0, 0)) +
        scale_y_discrete(expand = c(0, 0))


It clearly shows that the model prediction varies over the subjects. For a full report of my project see GitHub.