11  reproducible::prepInputs for Data

Author

Eliot McIntire

Published

June 21, 2024

See Barebones R script for the code shown in this chapter

Every project that has data will need to get the data from its original source into a form and format that works for a given project. To do this in a reproducible/reusable manner is not trivial.

Some elements that should be present when bringing the data into a project in a reproducible/reusable manner can ideally include:

Similar, if the data are to be re-used in many projects, the approach to getting the data should be:

We designed a function to achieve all this: reproducible::prepInputs

11.1 prepInputs

This function can be used to prepare R objects from remote or local data sources. The object of this function is to provide a reproducible version of a series of commonly used steps for getting, loading, and processing data.

11.2 Simplest case: a url

If we can point directly to a file on a cloud location, prepInputs does its thing, getting the local file.

Code
Require::Require(c("terra", "reproducible"), 
                 repos = c(getOption("predictiveecology.r-universe.dev", getOption("repos"))))
url <- "https://github.com/tati-micheletti/host/raw/master/data/rasterTest.tif"
ras <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")

terra::plot(ras)

11.3 Case: in an archive (like zip)

Often, cloud-stored files are in an archive, like a zip or tar.gz. These can be handled automatically with prepInputs. By default, it will try to extract all files from the archive, and load one of them into R:

Code
url <- "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip"
shpEcozone <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")
terra::plot(shpEcozone[1])

11.4 Case: local stash using reproducible.inputPaths

If we want to use the same file in many projects, we can obviously just download it again in each project, using the same lines of prepInputs. However, we can setup a local “stash” of data files. This is a set it and forget it option in reproducible.

Code
options(reproducible.inputPaths = "~/data")
url <- "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip"
shpEcozone <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")
terra::plot(shpEcozone[1])

If we run this chunk in the same project, results will be like the previous chunk: a vector dataset from the Canadian federal government. BUT, if we run it in a different project, it will check the stashed diretory and won’t need to download it again. Moreover, it will not actually make a copy but a link to the file, so it won’t take extra space on your hard drive.

If you are running in a *nix-like system (Mac, Linux, etc.) or you have Git Bash installed, you can see the number of links to a file with ls. The column after the attributes (e.g., drwxr-xr-x), is the number of pointers to a file. Anything more than 1 means that there are several pointers.

system(paste0("ls -l", getOption("inputPaths")))

11.5 Stages of prepInputs

This function has three stages, using three internal functions:

  • reproducible::preProcess which is getting data (download, extracting from archives, checksumming)

  • reproducible::process loads the data into R (note: user can’t access this directly)

  • reproducible::postProcessTo is used with Spatial* and Raster* objects, and crops, reprojects, mask/intersects.

preProcess will be run, if a user supplies url or archive.

process will be run if fun is not NA. It will try to guess fun based on file type if fun is left at the default of NULL.

postProcessTo will be triggered if the user passes any its aruments (to, maskTo, cropTo) studyArea or rasterToMatch are supplied. Other arguments like method can also be to inner functions (e.g. terra::project) to control spatial operations. See ... in ?postProcessTo.

11.6 With GIS

The data we download may cover a larger area than our study needs. For example, there may be a provincial or national dataset, but we are only using a part of that. If we create a studyArea, we can use this as part of prepInputs, passing it to to or one of the other *to functions.

11.6.1 Vector data

The following will get the national ecozone map, project, crop and mask it to the SK dataset.

Code
SK <- SpaDES.project::setupStudyArea(list(level = 2, NAME_1 = "Saskatchewan"))
shpEcozoneSK <- prepInputs(
  url = "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip",
  to = SK, destinationPath = "~/SpaDES_book/data/prepInputs"
)
terra::plot(shpEcozoneSK[1])

11.6.2 Raster data

The following will get a national DEM layer, crop and mask it to the SK dataset. We leave it in the projection it came in at the original source.

Code
urlTif1 <- "https://geodata.ucdavis.edu/geodata/elv/CAN_elv_msk.zip"
# Turns out the previous steps (original data or postProccessed) have an error
try(DEMSK <- prepInputs(
  url = urlTif1,
  maskTo = shpEcozoneSK, 
  cropTo = shpEcozoneSK,
  destinationPath = "~/SpaDES_book/data/prepInputs"
))

shpEcozoneSK <- fixErrorsIn(shpEcozoneSK)
# Rerun
DEMSK <- prepInputs(
  url = urlTif1,
  cropTo = shpEcozoneSK,
  maskTo = shpEcozoneSK, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
)

terra::plot(DEMSK)
terra::plot(shpEcozoneSK[1], add = TRUE, col = "transparent")

Of course, we may do many other things, including change the resolution of the downloaded object, change the projection, and resolution. prepInputs uses a template raster to know what to project, crop, and mask to

Code
SK <- SpaDES.project::setupStudyArea(list(level = 2, NAME_1 = "Saskatchewan", epsg = 5070))
SKras <- terra::rast(terra::ext(SK), res = 250, val = 1)
terra::crs(SKras) <- terra::crs(SK)

DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras,                   # <-   template raster
  destinationPath = "~/SpaDES_book/data/prepInputs"
)
terra::plot(DEMSK250m)

So, we can easily put these prepInputs calls in our workflows, in any project, with minimal new effort each time.

11.7 Caching

prepInputs is friendly with Cache and, in addition to the local stashing of a file, caches some internal operations (e.g. prepProcess). GIS operations, however, are not internally cached. Instead the user should use |> Cache() to enable having a workflow that doesn’t need to be broken.

Code
DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
) |> 
  Cache()

# run later... very fast
DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
) |> 
  Cache()

Note that changing the default cachePath is advisable to have persistent caches between R sessions. See Section 10.3 in Chapter 10.

11.8 Extending prepInputs with fun

It is impossible for prepInputs to be able to handle all data capturing cases. When customization is needed, a few options are possible.

  1. Make a custom fun argument. Optionally, dlFun can be used to specify how to download the data.
  2. Make a custom prepInputs_XXX function that calls prepInputs internally.

The following example uses a custom fun. These examples are fast, so the benefits of caching won’t be as apparent as for when more complex GIS operations are needed, e.g. reprojecting large raster datasets.

Code
LuxembourgFull <- prepInputs(
 dlFun =
   geodata::gadm(name = "GADM", country = "LUX", level = 0, path = "~/SpaDES_book/prepInputs") # preProcess keeps file from this!
) |> Cache()

LuxembourgInnerBuffer <- prepInputs(
  dlFun =
    geodata::gadm(name = "GADM", country = "LUX", level = 0, path = "~/SpaDES_book/prepInputs") # preProcess keeps file from this!
  ,
  fun = {                      # Custom function
    out <- readRDS(targetFile)
    out <- terra::vect(out)
    terra::buffer(out, 5000)
  }
) |> Cache()

terra::plot(LuxembourgInnerBuffer)
terra::plot(Luxembourg, add = TRUE)

By rerunning the previous code, the caching will be in effect. This means that the slow, two-or-more-stage processes of downloading, custom operations, and possibly GIS using `to functions can all be Cached in a single step, resulting in the smallest possible output being saved, i.e., intermediate large objects have only limited “saving”.

NOTE: Rmarkdown and Quarto both have a form of caching of chunks. This can be effective in some circumstances, but it is somewhat limited (as are all forms of Caching – they will hit their limits somewhere). For instance, they do not deal with the pointers of C++ objects in terra.

11.9 See also

There are many other elements of this function that can be effectively used.

reproducible::prepInputs

reproducible::preProcess

reproducible::postProcessTo

11.10 Barebones R script

Code
Require::Require(c("terra", "reproducible"), 
                 repos = c(getOption("predictiveecology.r-universe.dev", getOption("repos"))))
url <- "https://github.com/tati-micheletti/host/raw/master/data/rasterTest.tif"
ras <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")

terra::plot(ras)

url <- "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip"
shpEcozone <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")
terra::plot(shpEcozone[1])

options(reproducible.inputPaths = "~/data")
url <- "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip"
shpEcozone <- prepInputs(url = url, destinationPath = "~/SpaDES_book/data/prepInputs")
terra::plot(shpEcozone[1])

SK <- SpaDES.project::setupStudyArea(list(level = 2, NAME_1 = "Saskatchewan"))
shpEcozoneSK <- prepInputs(
  url = "http://sis.agr.gc.ca/cansis/nsdb/ecostrat/zone/ecozone_shp.zip",
  to = SK, destinationPath = "~/SpaDES_book/data/prepInputs"
)
terra::plot(shpEcozoneSK[1])

urlTif1 <- "https://geodata.ucdavis.edu/geodata/elv/CAN_elv_msk.zip"
# Turns out the previous steps (original data or postProccessed) have an error
try(DEMSK <- prepInputs(
  url = urlTif1,
  maskTo = shpEcozoneSK, 
  cropTo = shpEcozoneSK,
  destinationPath = "~/SpaDES_book/data/prepInputs"
))

shpEcozoneSK <- fixErrorsIn(shpEcozoneSK)
# Rerun
DEMSK <- prepInputs(
  url = urlTif1,
  cropTo = shpEcozoneSK,
  maskTo = shpEcozoneSK, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
)

terra::plot(DEMSK)
terra::plot(shpEcozoneSK[1], add = TRUE, col = "transparent")

SK <- SpaDES.project::setupStudyArea(list(level = 2, NAME_1 = "Saskatchewan", epsg = 5070))
SKras <- terra::rast(terra::ext(SK), res = 250, val = 1)
terra::crs(SKras) <- terra::crs(SK)

DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras,                   # <-   template raster
  destinationPath = "~/SpaDES_book/data/prepInputs"
)
terra::plot(DEMSK250m)

DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
) |> 
  Cache()

# run later... very fast
DEMSK250m <- prepInputs(
  url = urlTif1,
  to = SKras, 
  destinationPath = "~/SpaDES_book/data/prepInputs"
) |> 
  Cache()

LuxembourgFull <- prepInputs(
 dlFun =
   geodata::gadm(name = "GADM", country = "LUX", level = 0, path = "~/SpaDES_book/prepInputs") # preProcess keeps file from this!
) |> Cache()

LuxembourgInnerBuffer <- prepInputs(
  dlFun =
    geodata::gadm(name = "GADM", country = "LUX", level = 0, path = "~/SpaDES_book/prepInputs") # preProcess keeps file from this!
  ,
  fun = {                      # Custom function
    out <- readRDS(targetFile)
    out <- terra::vect(out)
    terra::buffer(out, 5000)
  }
) |> Cache()

terra::plot(LuxembourgInnerBuffer)
terra::plot(Luxembourg, add = TRUE)