In part 4 of this series on benchmarking R, we’ll explore loops and a common alternative, vectorizing. This is probably the “biggest” issue making people think that R is a slow language. Essentially, other procedural languages use explicit loops; programmers moving from those languages to R start with the same procedures and find that R is slow. We will discuss a range of ways making loops faster and how vectorizing can help.
There are many other resources about this topic; we will try to be concise and show the worst case, the best case, and many little steps in between.
Loops
Loops have been the achilles heel of R in the past. In version 3.1 and forward, much of this problem appears to be gone. As could be seen in the https://predictiveecology.org/2015/05/06/Is-R-fast-enough-03.html, pre-allocating a vector and filling it up inside a loop can now be very fast and efficient in native R. To demonstrate these points, below are 6 ways to achieve the same result in R, beginning with a naive loop approach, and working up to the fully vectorized approach. I am using a very fast vectorized function, seq_len, to emphasize the differences between using loops and optimized vectorized functions.
The basic code below generates random numbers. The sequence goes from a fully unvectorized, looped structure, with no pre-allocation of the output vector, through to pure vectorized code. The intermediate steps are:
Loop
Loop with pre-allocated length of output
sapply (like loops)
sapply with pipe operator
vectorized
vectorized with no intermediate objects
C++ vectorized
library(magrittr) # for pipe %>%N =1e5mb = microbenchmark::microbenchmark(times=100L,# no pre-allocating of vector length, generating uniform random numbers once, then calling them within each looploopWithNoPreallocate = {set.seed(104) a <-numeric() unifs =runif(N)for (i in1:N) { a[i] = unifs[i] } a } ,# pre-allocating vector length, generating uniform random numbers once, then calling them within each looploopWithPreallocate = {set.seed(104) unifs <-runif(N) b <-numeric(N) for (i in1:N) { b[i] = unifs[i] } b },# # sapply - generally faster than loopssapplyVector1 = {set.seed(104) b <-runif(N) sapply(b,function(x) x) },# sapply with pipe operator: no intermediate objects are createdsapplyWithPipe = {set.seed(104) b <- (runif(N)) %>%sapply(.,function(x) x) },# vectorized with intermediate object before returnvectorizedWithCopy = {set.seed(104) unifs <-runif(N) unifs },# no intermediate object before returnvectorizedWithNoCopy = {set.seed(104)runif(N) })summary(mb)[c(1,2,5,7)]
# Test that all results return the same vectorall.equalV(loopWithNoPreallocate, loopWithPreallocate, sapplyVector1, sapplyWithPipe, vectorizedWithCopy, vectorizedWithNoCopy)
[1] TRUE
sumLoops <-round(summary(mb)[[5]],0)
The fully vectorized function is 15x faster than the fully naive loop. Note also that making as few intermediate objects as possible is faster as well. Comparing vectorizedWithCopy and vectorizedWithNoCopy (where the only difference is making one copy of the object) shows virtually no change. This, I believe, is due to some improvements in after version 3.1 of R that reduces copying for vectors and matrices. Using pipes instead of intermediate objects also did not change the speed very much (slight change by 100%). These are simple tests, and for larger, or more complex objects, in general, it is likely that using pipes will be faster.
Conclusions
Write vectorized code in R where possible. If not possible, pre-allocate prior to writing loops.
Next time
We move on to higher level operations. Specifically, some GIS operations.