Predictive Analytics with R

I did a project for a data scientist training on Coursera. Using machine learning techniques in R our goal was to predict which exercise subjects did using sensor data.

Using devices such as Jawbone UpNike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways:

  • exactly according to the specification (Class A)
  • throwing the elbows to the front (Class B)
  • lifting the dumbbell only halfway (Class C)
  • lowering the dumbbell only halfway (Class D)
  • throwing the hips to the front (Class E)

After creating a model the goal was to predict how 20 subjects performed the barbell lift. If you use the predict function R gives you (in this case) 20 answers (A to E). (bestfit is the model, pml.submission is the dataset with 20 subjects)

final_predictions <- predict(bestfit, pml.submission)
final_predictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

But what R does is give you the answer with the highest probability. Because it calculates a probability for each class for every subject. If you add type=”prob” to your predict call you get the probability for each class. I’ve used ggplot2 to visualize the probability distribution for the classes. It clearly shows that the model prediction varies over the subjects.

predprob <- predict(bestfit, pml.submission, type = "prob")
predprob$testcase <- 1:nrow(predprob)
predprob <- gather(predprob, "class", "prob", 1:5)
ggplot(predprob, aes(testcase, class)) +
        geom_tile(aes(fill = prob), colour = "white") +
        geom_text(aes(fill = prob, label = round(prob, 2)), size=3, colour="grey25") +
        scale_fill_gradient(low = "white", high = "red") +
        scale_x_discrete(expand = c(0, 0)) +
        scale_y_discrete(expand = c(0, 0))


It clearly shows that the model prediction varies over the subjects. For a full report of my project see GitHub.