Working with Continuous Workflows

Authors

Eliot McIntire

Ceres Barros

Published

July 5, 2024

As many have tried, it is not easy to create a project that does all steps, from start to finish, without failing. Numerous studies (ecology hasn’t done it yet) are finding very low success in this area (Janssen, Pritchard, and Lee 2020). Indeed, one paper out of 453 published studies were the authors able to run code without error (Sharma et al. 2023). Where does that leave us? We face a large challenge.

The reality is that it is not enough to script an analysis. It must be scripted and tested regularly, but multiple actors (computers, servers, people), on multiple systems. By regularly, we mean from start-to-finish over and over. In our experience this is impossible without a few tools because it takes too long to rerun code from start to finish, so we don’t.

The most important indicator that a workflow is broken is that there are lines of code that are selectively run or skipped by a user.

Tip 1: Rerun-tolerance

A rerun-tolerant function, has 2 features:

every time a function is run, it must return the same thing
time consuming function calls must be very fast after the first time

A rerun-tolerant workflow has a further feature:

rerun-tolerance is used hierarchically, i.e., used at many levels inside code, outer functions, inner functions, middle functions.

If all 3 of these are used, then it will be unnecessary to manually break a workflow, so it will be regularly run.

The key to getting our code into shape is to use approaches that are Tip 1. We will introduce 3 important tools to achieve this:

Caching – This is a general tool that many people will be somewhat familiar with.
reproducible::prepInputs – a “do-it-all” function for getting data into shape.
Debugging – for general R and specific SpaDES situations.

SpaDES and Continuous Workflows

SpaDES allows us to implement feature 3 (hierarchical caching) of Tip 1 because we can Cache at many levels: functions, events, modules, meta-modules, or entire projects. As we are developing inside a project, we can be working on a step inside a module that relies on previous “upstream” modules, but those modules can run very fast, so we can be quickly where we need to be, without breaking the workflow.

Thinking about this situation, we can now re-visit the idea of “Module Types” (introduced in 8.4 Module types) in terms of how they might depend on each other and certain types of inputs, and the types of outputs they produce.

Module Types (Revisited)

SpaDES doesn’t have an explicit module types. Rather, by convention, we associate different modules with generic things they do. A fairly comprehensive list of modules types that we create are:

Data Munging – uses prepInputs and outputs clean, usable data for a project, mostly specifies createsOutputs
Parameterization – uses data from e.g., a Data Munging module and uses statistics on data. This would be recognizable as a complete ecological study for publication.
Calibration – similar to Parameterization but uses repeated simulation to estimate parameters.
Study Area specifics – creates outputs that are very idiosyncratic to a study area
Static Forecasting – creates outputs at different times, that are not used by itself
Dynamic Forecasting – creates outputs at different times, that are also its own inputs
Validation – takes many repeated inputs, often many replicates
Visualization – takes inputs from one or many replicated module runs and build rich visuals
Post-Processing – takes inputs from one or many replicated module runs and builds data analytics. These may be similar to Validation

Thus, the differences are about how the inputObjects and outputObjects are specified and designed.