10  Introduction to Cache

Author

Eliot J. B. McIntire

Published

November 15, 2024

See Barebones R script for the code shown in this chapter

The objective of a reproducible workflow is likely that the entire work flow from raw data to publication, decision support, report writing, presentation building etc., can be built and be reproducible anywhere, on any computer and operating system, with any starting conditions, on demand.

As part of a reproducible workflow, caching of function calls, code chunks, and other elements of a project can be very valuable. Caching allows a code writer to run all code regularly without “secretly” skipping certain lines because they take too long to run. This has 2 benefits.

  1. Each line gets run regularly so when failures crop up, they are detected quickly and can be fixed when they are introduced.

  2. Running code “somewhere else” (different machine, person, operating system) will be more likely to work on an ongoing basis, i.e., it will help maintain a “reproducible” state.

These benefits allow code to be constantly in a state of functioning, from start to finish, which lowers the effort “at the end” to make all the work reproducible.

The reproducible::Cache() function is built to work with many R functions, including some that are used for their side effects or that use pointers (e.g., a terra SpatRaster) instead of regular R objects.

10.1 How to use Cache

The Cache function can be used with any function. A user can wrap it around another function call, use the base pipe operator |> or specify the function and function arguments as arguments to Cache. The following calls are the same, and thus the 2nd and 3rd will result in the identical output to the first:

Code
library(reproducible)
reproducible::Cache(rnorm(1))
[1] -0.2146587
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""
Code
rnorm(1) |>
  reproducible::Cache()
[1] -0.2146587
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""
Code
reproducible::Cache(rnorm,
                    n = 1)
[1] -0.2146587
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""

See ?reproducible::Cache for many more examples.

10.2 When to use Cache

The most obvious case to use Cache is when the calculation is expensive. For example, GIS operations are commonly time consuming. In the example below, we will use terra::project three times, with timings.

10.2.1 From disk

Code
# Data setup
library(terra)
tmpDir <- file.path(tempdir(), "reproducible_examples", "Cache")
dir.create(tmpDir, recursive = TRUE)

x <- y <- 2001
ras <- terra::rast(terra::ext(0, x, 0, y), vals = sample(1:(x*y)), res = 1)
terra::crs(ras) <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"
newCRS <- "+init=epsg:4326" # A longlat crs

# Call slow operation: project
# No Cache
system.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ
   user  system elapsed 
   0.11    0.00    2.12 
Code
# With Cache -- a little slower the first time because saving to disk
system.time(map2 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   1.69    0.07    9.31 
Code
# faster the second time; improvement depends on size of object and time to run function
system.time(map3 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   0.19    0.00    1.06 

In this example, the object map3 gets cached, along with the evaluation of the function terra::project and its arguments (ras and newCRS). If either the function or supplied arguments change, Cache repeats the operation and re-caches the output in a new cache entry.

10.2.2 From disk and memory

We can set an option so that objects will also be saved to disk as normal, but they will also be saved as RAM objects – “memoising”.

Code
optOrig <- options("reproducible.useMemoise" = TRUE)
system.time(map4 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   0.19    0.02    1.06 
Code
system.time(map5 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   0.00    0.00    0.14 
Code
options(optOrig)

10.3 Where does the cache live?

By default, on a temporary folder which does not persist between R sessions. To see where this folder is run:

Code
options("reproducible.cachePath")
$reproducible.cachePath
[1] "C:\\Users\\emcintir\\AppData\\Local\\Temp\\Rtmp6d1bIb/reproducible/cache"

For a persisting cache, we should change to a permanent folder path. This can be done by in two ways:

Code
Cache(rnorm(1), 
      cachePath = "~/SpaDES_book/cache")
options("reproducible.cachePath")   ## still the temporary directory


options("reproducible.cachePath" = "~/SpaDES_book/cache")
Cache(rnorm(1))

The second, using options sets the cachePath for any subsequent Cache call.

In a SpaDES workflow context, the cached directory can be set by passing the list(..., cachePath = <a_path>)1 to the paths argument in setupProject, simInit or simInitAndSpaDES.

10.4 Caching examples

10.4.1 Basic cache use with tags

We can add tags to identify the Cache call.

Code
ranNumsA <- Cache(rnorm, 4, userTags = c("random number generator"))

showCache(userTags = "random number generator")
             cacheId              tagKey                  tagValue
              <char>              <char>                    <char>
 1: ad0ea27476c50b66            function                     rnorm
 2: ad0ea27476c50b66            userTags   random number generator
 3: ad0ea27476c50b66            accessed 2024-11-15 17:14:11.00470
 4: ad0ea27476c50b66             inCloud                     FALSE
 5: ad0ea27476c50b66   elapsedTimeDigest           0.01090217 secs
 6: ad0ea27476c50b66           preDigest        n:7eef4eae85fd9229
 7: ad0ea27476c50b66           preDigest     mean:c40c00762a0dac94
 8: ad0ea27476c50b66           preDigest       sd:853b1797f54b229c
 9: ad0ea27476c50b66           preDigest     .FUN:4f604aa46882b368
10: ad0ea27476c50b66               class                   numeric
11: ad0ea27476c50b66         object.size                        80
12: ad0ea27476c50b66            fromDisk                     FALSE
13: ad0ea27476c50b66          resultHash                          
14: ad0ea27476c50b66 elapsedTimeFirstRun          0.001736879 secs
                  createdDate
                       <char>
 1: 2024-11-15 17:14:11.00822
 2: 2024-11-15 17:14:11.00822
 3: 2024-11-15 17:14:11.00822
 4: 2024-11-15 17:14:11.00822
 5: 2024-11-15 17:14:11.00822
 6: 2024-11-15 17:14:11.00822
 7: 2024-11-15 17:14:11.00822
 8: 2024-11-15 17:14:11.00822
 9: 2024-11-15 17:14:11.00822
10: 2024-11-15 17:14:11.00822
11: 2024-11-15 17:14:11.00822
12: 2024-11-15 17:14:11.00822
13: 2024-11-15 17:14:11.00822
14: 2024-11-15 17:14:11.00822

10.5 Clean up cache

We can use either clearCache, keepCache, or cc to remove things from the Cache database. clearCache removes everything that matches the query. keepCache keeps everything that matches the query. cc removes the most recent entry (i.e., it is a shorthand for a commonly used option).

Code
# Two different functions
a <- rnorm(1) |> Cache()
b <- runif(1) |> Cache()

# Clear only the first one
clearCache(userTags = "rnorm", ask = FALSE)
a2 <- rnorm(1) |> Cache()
b2 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a == a2 # FALSE
b == b2 # TRUE

# This time keep ONLY the rnorm
keepCache(userTags = "rnorm", ask = FALSE)
a3 <- rnorm(1) |> Cache()
b3 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a2 == a3 # TRUE
b2 == b3 # FALSE

10.6 Nested Caching

Nested caching is when Caching of a function occurs inside an outer function, which is itself cached. This is a critical element to working within a reproducible work flow. Ideally, at all points in a development cycle, it should be possible to get to any line of code starting from the very initial steps, running through everything up to that point, in less than a few seconds. If the workflow can be kept very fast like this, it will be very likely to work at any point it is tested.

In the example here, we run an outer function that calls an inner function. If we decide to change the outer function along the way, and the inner function is unaffected, then we can still recover the cached version of the inner call.

Warning this will not necessarily work the other way, i.e., if inner is changed, we won’t notice until we clearCache and rerun.

Code
# Make 2 functions
inner <- function(mean) {
  d <- 1
  Cache(rnorm(3, mean = mean))
}
outer <- function(n) {
  Cache(inner(0.1))
}

# Call outer function
Cache(outer(n = 2))
[1] -0.594529768 -0.005023595  1.419813487
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:aa5fa876944e27be"
attr(,"call")
[1] ""
Code
# Change outer function
outer <- function(n) {
  a <- 0.1
  Cache(inner(a))
}

# Still recovers inner 
Cache(outer(n = 2))
[1] -0.594529768 -0.005023595  1.419813487
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:293cb55e4c25d110"
attr(,"call")
[1] ""
Code
# BUT if we change the inner, it won't work
inner <- function(mean) {
  d <- 2                        # Changed d
  Cache(rnorm(3, mean = mean))
}
Cache(outer(n = 2))
[1] -0.594529768 -0.005023595  1.419813487
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:293cb55e4c25d110"
attr(,"call")
[1] ""

10.7 Best practices

In general, we have found that use of Cache to be beneficial when used as following:

  1. Slow functions get cached. “Slow” can be “slower than using Cache

  2. Regularly clearCache(ask = FALSE), e.g., at the end of a day or week of work, then let it run.

Some of our team regularly add:

  1. If using SpaDES, use cache at the event level, if the event is non-stochastic

  2. Don’t cache a simInit call – instead implement internal caching in the modules and use event caching. See vignette(topic = "iii-cache", package = "SpaDES.core")

10.8 See also

SpaDES.core vignette on caching

10.9 Barebones R script

Code
library(reproducible)
reproducible::Cache(rnorm(1))

rnorm(1) |>
  reproducible::Cache()

reproducible::Cache(rnorm,
                    n = 1)

# Data setup
library(terra)
tmpDir <- file.path(tempdir(), "reproducible_examples", "Cache")
dir.create(tmpDir, recursive = TRUE)

x <- y <- 2001
ras <- terra::rast(terra::ext(0, x, 0, y), vals = sample(1:(x*y)), res = 1)
terra::crs(ras) <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"
newCRS <- "+init=epsg:4326" # A longlat crs

# Call slow operation: project
# No Cache
system.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ

# With Cache -- a little slower the first time because saving to disk
system.time(map2 <- terra::project(ras, newCRS) |> Cache())

# faster the second time; improvement depends on size of object and time to run function
system.time(map3 <- terra::project(ras, newCRS) |> Cache())

optOrig <- options("reproducible.useMemoise" = TRUE)
system.time(map4 <- terra::project(ras, newCRS) |> Cache())

system.time(map5 <- terra::project(ras, newCRS) |> Cache())
options(optOrig)

options("reproducible.cachePath")



ranNumsA <- Cache(rnorm, 4, userTags = c("random number generator"))

showCache(userTags = "random number generator")

# Two different functions
a <- rnorm(1) |> Cache()
b <- runif(1) |> Cache()

# Clear only the first one
clearCache(userTags = "rnorm", ask = FALSE)
a2 <- rnorm(1) |> Cache()
b2 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a == a2 # FALSE
b == b2 # TRUE

# This time keep ONLY the rnorm
keepCache(userTags = "rnorm", ask = FALSE)
a3 <- rnorm(1) |> Cache()
b3 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a2 == a3 # TRUE
b2 == b3 # FALSE

# Make 2 functions
inner <- function(mean) {
  d <- 1
  Cache(rnorm(3, mean = mean))
}
outer <- function(n) {
  Cache(inner(0.1))
}

# Call outer function
Cache(outer(n = 2))

# Change outer function
outer <- function(n) {
  a <- 0.1
  Cache(inner(a))
}

# Still recovers inner 
Cache(outer(n = 2))

# BUT if we change the inner, it won't work
inner <- function(mean) {
  d <- 2                        # Changed d
  Cache(rnorm(3, mean = mean))
}
Cache(outer(n = 2))

  1. where ... are other paths, like modulePath .↩︎