In part 2 of this series on benchmarking R
, we’ll explore sorting. This has been a topic on numerous blogs, discussions and posts around the internet, including here: r-blogger post. Similarly, julialang.org showed that sorting was particularly bad in R
. We, again, felt that this was a case of poor R
coding, or more accurately, missing the point of whether R
was capable or not. Another example here compares R
sorting with standard library of C++, called from R
.
In all cases, we felt that one of the points of using R
is that there are concise ways of doing things because the open source community has brought them to R
. So lets take advantage of that! We will cover both real number sorting and integer sorting.
Sorting
This was in part inspired from a blog post by Wingfeet at https://www.r-bloggers.com/quicksort-speed-just-in-time-compiling-and-vectorizing/ which drew on benchmark tests here: https://julialang.org/ Essentially, that julia test was a benchmark to test the speed of Julia. It showed for the Quicksort, that R
is 524x slower than C. Below is that version. But, there was no explicit comparison of how the base R
sort would match with C, nor how any of the more recent packages with sorting capability fare against these procedural versions of low level languages.
Real number sorting
= runif(1e5)
x <- tbl_df(data.frame(x=x))
xtbl <- benchmark(
(mbReal <- qsort(x),
a <- sort(x),
d <- sort(x, method="quick"),
e <- .Internal(sort(x,decreasing = FALSE)),
f <- data.table(x=x,key="x"),
g <- arrange(xtbl,x),
h <- stl_sort(x),
i replications=25L, columns=c("test", "elapsed", "relative"),
order="relative"))
## test elapsed relative
## 7 i <- stl_sort(x) 0.19 1.000
## 5 g <- data.table(x = x, key = "x") 0.21 1.105
## 3 e <- sort(x, method = "quick") 0.22 1.158
## 4 f <- .Internal(sort(x, decreasing = FALSE)) 0.26 1.368
## 2 d <- sort(x) 0.28 1.474
## 6 h <- arrange(xtbl, x) 1.48 7.789
## 1 a <- qsort(x) 86.58 455.684
all.equalV(a, d, e, f, g$x, h$x, i)
## [1] TRUE
Integer sorting
= sample(1e6,size = 1e5)
x <- tbl_df(data.frame(x=x))
xtbl <- benchmark(
(mbInteger <- qsort(x),
a <- sort(x),
d <- sort(x, method="quick"),
e <- .Internal(sort(x,decreasing = FALSE)),
f <- data.table(x=x,key="x"), h<-arrange(xtbl,x),
g <- stl_sort(x),
i replications=25L, columns=c("test", "elapsed", "relative"),
order="relative"))
## test elapsed relative
## 5 g <- data.table(x = x, key = "x") 0.13 1.000
## 3 e <- sort(x, method = "quick") 0.17 1.308
## 7 i <- stl_sort(x) 0.19 1.462
## 4 f <- .Internal(sort(x, decreasing = FALSE)) 0.23 1.769
## 2 d <- sort(x) 0.25 1.923
## 6 h <- arrange(xtbl, x) 0.67 5.154
## 1 a <- qsort(x) 89.28 686.769
all.equalV(a, d, e, f, g$x, h$x, i)
## [1] TRUE
Both real numbers and integers can be sorted quickly with R. The slowest function is indeed the procedural qsort
written in native R
without any optimization. This was also the qsort
that the Julia testers used. Our numbers match almost exactly those from the the table in julialang.org; however, here we also test the real world R
usage that a normal R
user would face (i.e., we can all use sort()
). We show that R
competes quite favourably and regularly outperforms standard library of C++ (and Julia!, though that is not tested here explicitly).
Take home points:
- the basic
R
sorting functions are fast. Thesort(method="quick")
is about as fast as the standardC++
library sort (11% faster). - using data.table on integers is 32% faster than the
C++
standard library sort.
In real world situations, where we want to use the easiest, shortest code to produce fast, accurate results, R
certainly holds its own compared to the standard C++
library. But of course, there are many ways to do things in R
. Some are much faster than others.
Conclusion
Using the sort(method="quick")
and data.table
sorting, we were able to sort a vector of real numbers 412x faster than a naive procedural coding (qsort
) and 687x faster on a vector of integers. These put the data.table
sort as fast as or substantially faster than C or Fortran or Julia’s version of quicksort (based on timings on julialang.org).
YES! R
is more than fast enough.
Next time (really! I promised it last time…)
We will redo the Fibonacci series, a common low level benchmarking test that shows R
to be slow. But it turns out to be a case of bad coding…
Functions used
The C++ functions that were used are:
cppFunction('NumericVector stl_sort(NumericVector x) {
NumericVector y = clone(x);
std::sort(y.begin(), y.end());
return y;
}')
= function(a) {
qsort = function(lo, hi) {
qsort_kernel = lo
i = hi
j while (i < hi) {
= a[floor((lo+hi)/2)]
pivot while (i <= j) {
while (a[i] < pivot) i = i + 1
while (a[j] > pivot) j = j - 1
if (i <= j) {
= a[i]
t <<- a[j]
a[i] <<- t
a[j] = i + 1;
i = j - 1;
j
}
}if (lo < j) qsort_kernel(lo, j)
= i
lo = hi
j
}
}qsort_kernel(1, length(a))
return(a)
}
= function(...) {
all.equalV <- list(...)
vals all(sapply(vals[-1], function(x) all.equal(vals[[1]], x)))
}
System used:
Tests were done on an HP Z400, Xeon 3.33 GHz processor, running Windows 7 Enterprise, using:
## R version 3.2.0 (2015-04-16)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 7 x64 (build 7601) Service Pack 1
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.9.4 Rcpp_0.11.5 dplyr_0.4.1 rbenchmark_1.0.0
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.8 assertthat_0.1 chron_2.3-45 plyr_1.8.1
## [5] DBI_0.3.1 formatR_1.0 magrittr_1.5 evaluate_0.5.5
## [9] lazyeval_0.1.10 reshape2_1.4.1 rmarkdown_0.5.1 tools_3.2.0
## [13] stringr_0.6.2 yaml_2.1.13 parallel_3.2.0 htmltools_0.2.6
## [17] knitr_1.9