Code
# lm fits a linear model
lm(biomass ~ height)
Ceres Barros
Eliot McIntire
July 5, 2024
Developing repeatable, reproducible, reusable and transparent (\(R^3T\)) workflows in R (and probably across many other languages) requires not only automating workflow steps/processes, which we can achieve with SpaDES, but also to be attentive to the way we code and setup the actual (SpaDES) workflow.
Here are some general and workflow best practices that we recommend to SpaDES users and R programmers in general.
Script, script, script – this means script everything you do, including package installation.
“DOComment” your code – use comments that describe the intention of the code, not necessarily a code translation into human language.
Minimize the number of software/languages used – the more jumps between languages/software the less robust and reproducible the workflow is. Remember: other users will need to install the same software and be able to the same jumps/operations that you did without you looking over their shoulder.
Modularise and “functionise” – avoid loooooong scripts, which are harder to read, learn and debug. Instead bundle your code into modules and functions.
SpaDES helps the modularisation part a great deal but it is still up to the developer to create functions and to decide what workflow components can/should be broken into separate modules or events.
When using embedded module and event caching, think about what workflow steps can be cached to speed up development and re-runs and consider turning these into separate modules or events.
Consider “packaging” your functions so they can be used across projects and SpaDES modules.
Centralize workflow in a single script – this is quite intuitive when using SpaDES, as usually there is a “controller” script that sets up and runs the workflow.
Use project-oriented and self-contained workflows – pick a structure that is self-explanatory and make sure that the project can be “moved” between users/machines without changing the code.
Don’t use absolute paths – use relative paths (relative to project directory) – RStudio Projects and SpaDES.project do this automatically.
Version control – great for collaboration, great for tracking changes, great to recover old versions of a project/code
Test (!) – test your code and your workflow if you want them to endure and persist through time and users
Limit the number of objects in .GlobalEnv
– instead, put them in functions or inside a module. When objects are in modules, you can add default object values (e.g. via .inputObjects()
), or save them into files that are loaded by setupProject
or simInit
. Alternatively, you can pass objects or code to make them directly to setupProject()
as ...
arguments, or to the simInit
objects
arguments. See 6 Providing Module Inputs and, for an example with setupProject
, 14 Forest Landscape Modelling with LandR and setupProject.
.GlobalEnv
when they shouldn’t (they should use them from the simList
). SpaDES.project::setupProject
and simInit
deal with these issues by loading and objects directly into a separate environment (the simList
).Don’t install packages “manually” and “separately” – script package installation in a reproducible way, i.e. that includes reference to package versions. Even better, let setupProject
deal with package installation.
setupProject
was designed to deal with this issue using the Require
package.Don’t pre-load packages – let simInit
do it when the simulation is being initialized and only after it has asserted that packages are available and in the correct versions.
setupProject
(and simInit
) fail if they attempt to update a package to respect module requirements.Avoid installing packages into a cloud-synced/managed folder – files in and folders managed by cloud-storage software like OneDrive, Google Drive and Dropbox, are constantly being screened by these software which can lead to conflicts during package installation (e.g., the software holds on to certain files and R is unable to delete and re-install them). If package installation is scripted, there is no need to sync it with a server – you can recover it at any time.
Avoid pointing the cache to a cloud-synced/managed folder – for the same reasons as in the previous point.
EFI’s reproducible forecasting workflows
The Practice of Reproducible Research
Advanced R - Style guide; by Hadley Wickham
Paper by Sarma et al. (2016) on code testing
Assertions and unit tests are smaller and generally test specific operations within a workflow (e.g. if (1 + x != 2) stop("there is an error")
) or things like object integrity (e.g. if (!is(x, "data.frame") stop("there is an error")
). Integration tests are larger in the sense that they test many components of a workflow and whether they work well together (they can also test an entire workflow)↩︎