The objective of a reproducible workflow is likely that the entire work flow from raw data to publication, decision support, report writing, presentation building etc., can be built and be reproducible anywhere, on any computer and operating system, with any starting conditions, on demand.
As part of a reproducible workflow, caching of function calls, code chunks, and other elements of a project can be very valuable. Caching allows a code writer to run all code regularly without “secretly” skipping certain lines because they take too long to run. This has 2 benefits.
Each line gets run regularly so when failures crop up, they are detected quickly and can be fixed when they are introduced.
Running code “somewhere else” (different machine, person, operating system) will be more likely to work on an ongoing basis, i.e., it will help maintain a “reproducible” state.
These benefits allow code to be constantly in a state of functioning, from start to finish, which lowers the effort “at the end” to make all the work reproducible.
The reproducible::Cache() function is built to work with many R functions, including some that are used for their side effects or that use pointers (e.g., a terraSpatRaster) instead of regular R objects.
10.1 How to use Cache
The Cache function can be used with any function. A user can wrap it around another function call, use the base pipe operator |> or specify the function and function arguments as arguments to Cache. The following calls are the same, and thus the 2nd and 3rd will result in the identical output to the first:
The most obvious case to use Cache is when the calculation is expensive. For example, GIS operations are commonly time consuming. In the example below, we will use terra::project three times, with timings.
10.2.1 From disk
Code
# Data setuplibrary(terra)tmpDir <-file.path(tempdir(), "reproducible_examples", "Cache")dir.create(tmpDir, recursive =TRUE)x <- y <-2001ras <- terra::rast(terra::ext(0, x, 0, y), vals =sample(1:(x*y)), res =1)terra::crs(ras) <-"+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"newCRS <-"+init=epsg:4326"# A longlat crs# Call slow operation: project# No Cachesystem.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ
user system elapsed
0.11 0.00 2.12
Code
# With Cache -- a little slower the first time because saving to disksystem.time(map2 <- terra::project(ras, newCRS) |>Cache())
user system elapsed
1.69 0.07 9.31
Code
# faster the second time; improvement depends on size of object and time to run functionsystem.time(map3 <- terra::project(ras, newCRS) |>Cache())
user system elapsed
0.19 0.00 1.06
In this example, the object map3 gets cached, along with the evaluation of the function terra::project and its arguments (ras and newCRS). If either the function or supplied arguments change, Cache repeats the operation and re-caches the output in a new cache entry.
10.2.2 From disk and memory
We can set an option so that objects will also be saved to disk as normal, but they will also be saved as RAM objects – “memoising”.
For a persisting cache, we should change to a permanent folder path. This can be done by in two ways:
Code
Cache(rnorm(1), cachePath ="~/SpaDES_book/cache")options("reproducible.cachePath") ## still the temporary directoryoptions("reproducible.cachePath"="~/SpaDES_book/cache")Cache(rnorm(1))
The second, using options sets the cachePath for any subsequent Cache call.
In a SpaDES workflow context, the cached directory can be set by passing the list(..., cachePath = <a_path>)1 to the paths argument in setupProject, simInit or simInitAndSpaDES.
10.4 Caching examples
10.4.1 Basic cache use with tags
We can add tags to identify the Cache call.
Code
ranNumsA <-Cache(rnorm, 4, userTags =c("random number generator"))showCache(userTags ="random number generator")
We can use either clearCache, keepCache, or cc to remove things from the Cache database. clearCache removes everything that matches the query. keepCache keeps everything that matches the query. cc removes the most recent entry (i.e., it is a shorthand for a commonly used option).
Code
# Two different functionsa <-rnorm(1) |>Cache()b <-runif(1) |>Cache()# Clear only the first oneclearCache(userTags ="rnorm", ask =FALSE)a2 <-rnorm(1) |>Cache()b2 <-runif(1) |>Cache()# b2 and b are still identical; a and a2 are nota == a2 # FALSEb == b2 # TRUE# This time keep ONLY the rnormkeepCache(userTags ="rnorm", ask =FALSE)a3 <-rnorm(1) |>Cache()b3 <-runif(1) |>Cache()# b2 and b are still identical; a and a2 are nota2 == a3 # TRUEb2 == b3 # FALSE
10.6 Nested Caching
Nested caching is when Caching of a function occurs inside an outer function, which is itself cached. This is a critical element to working within a reproducible work flow. Ideally, at all points in a development cycle, it should be possible to get to any line of code starting from the very initial steps, running through everything up to that point, in less than a few seconds. If the workflow can be kept very fast like this, it will be very likely to work at any point it is tested.
In the example here, we run an outer function that calls an inner function. If we decide to change the outer function along the way, and the inner function is unaffected, then we can still recover the cached version of the inner call.
Warning this will not necessarily work the other way, i.e., if inner is changed, we won’t notice until we clearCache and rerun.
Code
# Make 2 functionsinner <-function(mean) { d <-1Cache(rnorm(3, mean = mean))}outer <-function(n) {Cache(inner(0.1))}# Call outer functionCache(outer(n =2))
library(reproducible)reproducible::Cache(rnorm(1))rnorm(1) |> reproducible::Cache()reproducible::Cache(rnorm,n =1)# Data setuplibrary(terra)tmpDir <-file.path(tempdir(), "reproducible_examples", "Cache")dir.create(tmpDir, recursive =TRUE)x <- y <-2001ras <- terra::rast(terra::ext(0, x, 0, y), vals =sample(1:(x*y)), res =1)terra::crs(ras) <-"+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"newCRS <-"+init=epsg:4326"# A longlat crs# Call slow operation: project# No Cachesystem.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ# With Cache -- a little slower the first time because saving to disksystem.time(map2 <- terra::project(ras, newCRS) |>Cache())# faster the second time; improvement depends on size of object and time to run functionsystem.time(map3 <- terra::project(ras, newCRS) |>Cache())optOrig <-options("reproducible.useMemoise"=TRUE)system.time(map4 <- terra::project(ras, newCRS) |>Cache())system.time(map5 <- terra::project(ras, newCRS) |>Cache())options(optOrig)options("reproducible.cachePath")ranNumsA <-Cache(rnorm, 4, userTags =c("random number generator"))showCache(userTags ="random number generator")# Two different functionsa <-rnorm(1) |>Cache()b <-runif(1) |>Cache()# Clear only the first oneclearCache(userTags ="rnorm", ask =FALSE)a2 <-rnorm(1) |>Cache()b2 <-runif(1) |>Cache()# b2 and b are still identical; a and a2 are nota == a2 # FALSEb == b2 # TRUE# This time keep ONLY the rnormkeepCache(userTags ="rnorm", ask =FALSE)a3 <-rnorm(1) |>Cache()b3 <-runif(1) |>Cache()# b2 and b are still identical; a and a2 are nota2 == a3 # TRUEb2 == b3 # FALSE# Make 2 functionsinner <-function(mean) { d <-1Cache(rnorm(3, mean = mean))}outer <-function(n) {Cache(inner(0.1))}# Call outer functionCache(outer(n =2))# Change outer functionouter <-function(n) { a <-0.1Cache(inner(a))}# Still recovers inner Cache(outer(n =2))# BUT if we change the inner, it won't workinner <-function(mean) { d <-2# Changed dCache(rnorm(3, mean = mean))}Cache(outer(n =2))