10  Introduction to Cache

Author

Eliot J. B. McIntire

Published

June 21, 2024

See Barebones R script for the code shown in this chapter

The objective of a reproducible workflow is likely that the entire work flow from raw data to publication, decision support, report writing, presentation building etc., can be built and be reproducible anywhere, on any computer and operating system, with any starting conditions, on demand.

As part of a reproducible workflow, caching of function calls, code chunks, and other elements of a project can be very valuable. Caching allows a code writer to run all code regularly without “secretly” skipping certain lines because they take too long to run. This has 2 benefits.

  1. Each line gets run regularly so when failures crop up, they are detected quickly and can be fixed when they are introduced.

  2. Running code “somewhere else” (different machine, person, operating system) will be more likely to work on an ongoing basis, i.e., it will help maintain a “reproducible” state.

These benefits allow code to be constantly in a state of functioning, from start to finish, which lowers the effort “at the end” to make all the work reproducible.

The reproducible::Cache() function is built to work with many R functions, including some that are used for their side effects or that use pointers (e.g., a terra SpatRaster) instead of regular R objects.

10.1 How to use Cache

The Cache function can be used with any function. A user can wrap it around another function call, use the base pipe operator |> or specify the function and function arguments as arguments to Cache. The following calls are the same, and thus the 2nd and 3rd will result in the identical output to the first:

Code
library(reproducible)
reproducible::Cache(rnorm(1))
[1] -0.6206162
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""
Code
rnorm(1) |>
  reproducible::Cache()
[1] -0.6206162
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""
Code
reproducible::Cache(rnorm,
                    n = 1)
[1] -0.6206162
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:422bae4ed2f770cc"
attr(,"call")
[1] ""

See ?reproducible::Cache for many more examples.

10.2 When to use Cache

The most obvious case to use Cache is when the calculation is expensive. For example, GIS operations are commonly time consuming. In the example below, we will use terra::project three times, with timings.

10.2.1 From disk

Code
# Data setup
library(terra)
tmpDir <- file.path(tempdir(), "reproducible_examples", "Cache")
dir.create(tmpDir, recursive = TRUE)

x <- y <- 2001
ras <- terra::rast(terra::ext(0, x, 0, y), vals = sample(1:(x*y)), res = 1)
terra::crs(ras) <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"
newCRS <- "+init=epsg:4326" # A longlat crs

# Call slow operation: project
# No Cache
system.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ
   user  system elapsed 
   1.91    0.06    2.00 
Code
# With Cache -- a little slower the first time because saving to disk
system.time(map2 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
  18.57    0.56   19.17 
Code
# faster the second time; improvement depends on size of object and time to run function
system.time(map3 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   1.58    0.12    1.72 

In this example, the object map3 gets cached, along with the evaluation of the function terra::project and its arguments (ras and newCRS). If either the function or supplied arguments change, Cache repeats the operation and re-caches the output in a new cache entry.

10.2.2 From disk and memory

We can set an option so that objects will also be saved to disk as normal, but they will also be saved as RAM objects – “memoising”.

Code
optOrig <- options("reproducible.useMemoise" = TRUE)
system.time(map4 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   1.32    0.06    1.41 
Code
system.time(map5 <- terra::project(ras, newCRS) |> Cache())
   user  system elapsed 
   0.14    0.03    0.19 
Code
options(optOrig)

10.3 Where does the cache live?

By default, on a temporary folder which does not persist between R sessions. To see where this folder is run:

Code
options("reproducible.cachePath")
$reproducible.cachePath
[1] "C:\\Users\\cbarros\\AppData\\Local\\Temp\\RtmpqA2PkG/reproducible/cache"

For a persisting cache, we should change to a permanent folder path. This can be done by in two ways:

Code
Cache(rnorm(1), 
      cachePath = "~/SpaDES_book/cache")
options("reproducible.cachePath")   ## still the temporary directory


options("reproducible.cachePath" = "~/SpaDES_book/cache")
Cache(rnorm(1))

The second, using options sets the cachePath for any subsequent Cache call.

In a SpaDES workflow context, the cached directory can be set by passing the list(..., cachePath = <a_path>)1 to the paths argument in setupProject, simInit or simInitAndSpaDES.

10.4 Caching examples

10.4.1 Basic cache use with tags

We can add tags to identify the Cache call.

Code
ranNumsA <- Cache(rnorm, 4, userTags = c("random number generator"))

showCache(userTags = "random number generator")
             cacheId                  tagKey                   tagValue
              <char>                  <char>                     <char>
 1: ad0ea27476c50b66 random number generator    random number generator
 2: ad0ea27476c50b66                function                      rnorm
 3: ad0ea27476c50b66                   class                    numeric
 4: ad0ea27476c50b66             object.size                       1008
 5: ad0ea27476c50b66                accessed 2024-06-21 20:31:06.275774
 6: ad0ea27476c50b66                 inCloud                      FALSE
 7: ad0ea27476c50b66                fromDisk                      FALSE
 8: ad0ea27476c50b66              resultHash                           
 9: ad0ea27476c50b66       elapsedTimeDigest           0.007800817 secs
10: ad0ea27476c50b66     elapsedTimeFirstRun          0.0001580715 secs
11: ad0ea27476c50b66          otherFunctions                    execute
12: ad0ea27476c50b66          otherFunctions               process_file
13: ad0ea27476c50b66          otherFunctions              process_group
14: ad0ea27476c50b66          otherFunctions                 call_block
15: ad0ea27476c50b66          otherFunctions                 block_exec
16: ad0ea27476c50b66          otherFunctions                      eng_r
17: ad0ea27476c50b66          otherFunctions               in_input_dir
18: ad0ea27476c50b66          otherFunctions                     in_dir
19: ad0ea27476c50b66          otherFunctions                  timing_fn
20: ad0ea27476c50b66          otherFunctions                     handle
21: ad0ea27476c50b66               preDigest         n:7eef4eae85fd9229
22: ad0ea27476c50b66               preDigest      mean:c40c00762a0dac94
23: ad0ea27476c50b66               preDigest        sd:853b1797f54b229c
24: ad0ea27476c50b66               preDigest      .FUN:4f604aa46882b368
             cacheId                  tagKey                   tagValue
                   createdDate
                        <char>
 1: 2024-06-21 20:31:06.277134
 2: 2024-06-21 20:31:06.277134
 3: 2024-06-21 20:31:06.277134
 4: 2024-06-21 20:31:06.277134
 5: 2024-06-21 20:31:06.277134
 6: 2024-06-21 20:31:06.277134
 7: 2024-06-21 20:31:06.277134
 8: 2024-06-21 20:31:06.277134
 9: 2024-06-21 20:31:06.277134
10: 2024-06-21 20:31:06.277134
11: 2024-06-21 20:31:06.277134
12: 2024-06-21 20:31:06.277134
13: 2024-06-21 20:31:06.277134
14: 2024-06-21 20:31:06.277134
15: 2024-06-21 20:31:06.277134
16: 2024-06-21 20:31:06.277134
17: 2024-06-21 20:31:06.277134
18: 2024-06-21 20:31:06.277134
19: 2024-06-21 20:31:06.277134
20: 2024-06-21 20:31:06.277134
21: 2024-06-21 20:31:06.277134
22: 2024-06-21 20:31:06.277134
23: 2024-06-21 20:31:06.277134
24: 2024-06-21 20:31:06.277134
                   createdDate

10.5 Clean up cache

We can use either clearCache, keepCache, or cc to remove things from the Cache database. clearCache removes everything that matches the query. keepCache keeps everything that matches the query. cc removes the most recent entry (i.e., it is a shorthand for a commonly used option).

Code
# Two different funcxtions
a <- rnorm(1) |> Cache()
b <- runif(1) |> Cache()

# Clear only the first one
clearCache(userTags = "rnorm", ask = FALSE)
a2 <- rnorm(1) |> Cache()
b2 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a == a2 # FALSE
[1] FALSE
Code
b == b2 # TRUE
[1] TRUE
Code
# This time keep ONLY the rnorm
keepCache(userTags = "rnorm", ask = FALSE)
             cacheId              tagKey                   tagValue
              <char>              <char>                     <char>
 1: 422bae4ed2f770cc            function                      rnorm
 2: 422bae4ed2f770cc               class                    numeric
 3: 422bae4ed2f770cc         object.size                        984
 4: 422bae4ed2f770cc            accessed 2024-06-21 20:31:06.605013
 5: 422bae4ed2f770cc             inCloud                      FALSE
 6: 422bae4ed2f770cc            fromDisk                      FALSE
 7: 422bae4ed2f770cc          resultHash                           
 8: 422bae4ed2f770cc   elapsedTimeDigest           0.007511854 secs
 9: 422bae4ed2f770cc elapsedTimeFirstRun          0.0001630783 secs
10: 422bae4ed2f770cc      otherFunctions                    execute
11: 422bae4ed2f770cc      otherFunctions               process_file
12: 422bae4ed2f770cc      otherFunctions              process_group
13: 422bae4ed2f770cc      otherFunctions                 call_block
14: 422bae4ed2f770cc      otherFunctions                 block_exec
15: 422bae4ed2f770cc      otherFunctions                      eng_r
16: 422bae4ed2f770cc      otherFunctions               in_input_dir
17: 422bae4ed2f770cc      otherFunctions                     in_dir
18: 422bae4ed2f770cc      otherFunctions                  timing_fn
19: 422bae4ed2f770cc      otherFunctions                     handle
20: 422bae4ed2f770cc           preDigest         n:853b1797f54b229c
21: 422bae4ed2f770cc           preDigest      mean:c40c00762a0dac94
22: 422bae4ed2f770cc           preDigest        sd:853b1797f54b229c
23: 422bae4ed2f770cc           preDigest      .FUN:4f604aa46882b368
             cacheId              tagKey                   tagValue
                   createdDate
                        <char>
 1: 2024-06-21 20:31:06.606392
 2: 2024-06-21 20:31:06.606392
 3: 2024-06-21 20:31:06.606392
 4: 2024-06-21 20:31:06.606392
 5: 2024-06-21 20:31:06.606392
 6: 2024-06-21 20:31:06.606392
 7: 2024-06-21 20:31:06.606392
 8: 2024-06-21 20:31:06.606392
 9: 2024-06-21 20:31:06.606392
10: 2024-06-21 20:31:06.606392
11: 2024-06-21 20:31:06.606392
12: 2024-06-21 20:31:06.606392
13: 2024-06-21 20:31:06.606392
14: 2024-06-21 20:31:06.606392
15: 2024-06-21 20:31:06.606392
16: 2024-06-21 20:31:06.606392
17: 2024-06-21 20:31:06.606392
18: 2024-06-21 20:31:06.606392
19: 2024-06-21 20:31:06.606392
20: 2024-06-21 20:31:06.606392
21: 2024-06-21 20:31:06.606392
22: 2024-06-21 20:31:06.606392
23: 2024-06-21 20:31:06.606392
                   createdDate
Code
a3 <- rnorm(1) |> Cache()
b3 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a2 == a3 # TRUE
[1] TRUE
Code
b2 == b3 # FALSE
[1] FALSE

10.6 Nested Caching

Nested caching is when Caching of a function occurs inside an outer function, which is itself cached. This is a critical element to working within a reproducible work flow. Ideally, at all points in a development cycle, it should be possible to get to any line of code starting from the very initial steps, running through everything up to that point, in less than a few seconds. If the workflow can be kept very fast like this, it will be very likely to work at any point it is tested.

In the example here, we run an outer function that calls an inner function. If we decide to change the outer function along the way, and the inner function is unaffected, then we can still recover the cached version of the inner call.

Warning this will not necessarily work the other way, i.e., if inner is changed, we won’t notice until we clearCache and rerun.

Code
# Make 2 functions
inner <- function(mean) {
  d <- 1
  Cache(rnorm(3, mean = mean))
}
outer <- function(n) {
  Cache(inner(0.1))
}

# Call outer function
Cache(outer(n = 2))
[1] 1.2429132 0.9079636 0.4121525
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:aa5fa876944e27be"
attr(,"call")
[1] ""
Code
# Change outer function
outer <- function(n) {
  a <- 0.1
  Cache(inner(a))
}

# Still recovers inner 
Cache(outer(n = 2))
[1] 1.2429132 0.9079636 0.4121525
attr(,".Cache")
attr(,".Cache")$newCache
[1] TRUE

attr(,"tags")
[1] "cacheId:293cb55e4c25d110"
attr(,"call")
[1] ""
Code
# BUT if we change the inner, it won't work
inner <- function(mean) {
  d <- 2                        # Changed d
  Cache(rnorm(3, mean = mean))
}
Cache(outer(n = 2))
[1] 1.2429132 0.9079636 0.4121525
attr(,".Cache")
attr(,".Cache")$newCache
[1] FALSE

attr(,"tags")
[1] "cacheId:293cb55e4c25d110"
attr(,"call")
[1] ""

10.7 Best practices

In general, we have found that use of Cache to be beneficial when used as following:

  1. Slow functions get cached. “Slow” can be “slower than using Cache

  2. Regularly clearCache(ask = FALSE), e.g., at the end of a day or week of work, then let it run.

Some of our team regularly add:

  1. If using SpaDES, use cache at the event level, if the event is non-stochastic

  2. Don’t cache a simInit call – instead implement internal caching in the modules and use event caching. See vignette(topic = "iii-cache", package = "SpaDES.core")

10.8 See also

SpaDES.core vignette on caching

10.9 Barebones R script

Code
library(reproducible)
reproducible::Cache(rnorm(1))

rnorm(1) |>
  reproducible::Cache()

# Data setup
library(terra)
tmpDir <- file.path(tempdir(), "reproducible_examples", "Cache")
dir.create(tmpDir, recursive = TRUE)
x <- y <- 2001
ras <- terra::rast(terra::ext(0, x, 0, y), vals = sample(1:(x*y)), res = 1)
terra::crs(ras) <- "+proj=lcc +lat_1=48 +lat_2=33 +lon_0=-100 +datum=WGS84"
newCRS <- "+init=epsg:4326" # A longlat crs

# Call slow operation: project
# No Cache
system.time(map1 <- terra::project(ras, newCRS)) # Warnings due to new PROJ

# With Cache -- a little slower the first time because saving to disk
system.time(map2 <- terra::project(ras, newCRS) |> Cache())

# faster the second time; improvement depends on size of object and time to run function
system.time(map3 <- terra::project(ras, newCRS) |> Cache())

optOrig <- options("reproducible.useMemoise" = TRUE)
system.time(map4 <- terra::project(ras, newCRS) |> Cache())

system.time(map5 <- terra::project(ras, newCRS) |> Cache())
options(optOrig)

ranNumsA <- Cache(rnorm, 4, userTags = c("random number generator"))

showCache(userTags = "random number generator")

# Two different funcxtions
a <- rnorm(1) |> Cache()
b <- runif(1) |> Cache()

# Clear only the first one
clearCache(userTags = "rnorm", ask = FALSE)
a2 <- rnorm(1) |> Cache()
b2 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a == a2 # FALSE
b == b2 # TRUE

# This time keep ONLY the rnorm
keepCache(userTags = "rnorm", ask = FALSE)
a3 <- rnorm(1) |> Cache()
b3 <- runif(1) |> Cache()

# b2 and b are still identical; a and a2 are not
a2 == a3 # TRUE
b2 == b3 # FALSE

# Make 2 functions
inner <- function(mean) {
  d <- 1
  Cache(rnorm(3, mean = mean))
}
outer <- function(n) {
  Cache(inner(0.1))
}

# Call outer function
Cache(outer(n = 2))

# Change outer function
outer <- function(n) {
  a <- 0.1
  Cache(inner(a))
}

# Still recovers inner 
Cache(outer(n = 2))

# BUT if we change the inner, it won't work
inner <- function(mean) {
  d <- 2                        # Changed d
  Cache(rnorm(3, mean = mean))
}
Cache(outer(n = 2))

  1. where ... are other paths, like modulePath .↩︎