This repository contains a Nextflow workflow used to orchestrate state- and county-level runs on AWS Batch and the Yale Center for Research Computing clusters. The workflow takes care of the following steps:
- Cleaning for all counties and states, using scripts from
covidestim-sources
. - Performing a model run for each county or state using said input data, using the
covidestim
package. - Aggregating these results
- Making the results available to the public by inserting them into our database, and by generating static files used on our public website, using scripts from
webworker
.
Nextflow:
- Manages and executes the Singularity (on YCRC) and Docker (local execution / AWS) containers that are used to run different scripts
- Passes data between different Nextflow processes
- Produces logs of what happened
- In certain situations, reruns models when they timeout or fail
Broadly speaking, a typical workflow execution produces state or county estimates, and optionally makes them publicly available.
Notes:
- Broadcast and aggregation steps are generally handled through the use of Nextflow channel operators - see
main.nf
. - CLI options and profiles control various behaviors of each process.
Notes:
- CLI options and profiles control various behaviors of each process.
Install Nextflow at nextflow.io. You need at least 21.09.0-edge
because we use the Secrets feature. Nextflow will take care of downloading all containers from Docker Hub the first time you run this pipeline. Next, take note of the following configuration options. You'll want to use different configuration options depending on whether you're developing locally, testing on the cluster, or deploying production code.
There are four different configurations of the API:
- No API
- Local API + optional dbstan integration
- Test API + optional dbstan integration
- Production API + optional dbstan integration
Most configurations require setting Nextflow Secrets. Be sure to read the Secrets documentation before proceeding.
This runs the pipeline without persisting any data to a database, useful for model testing.
option | value |
---|---|
--insertApi |
false |
You can easily spin up a local database and API by cloning covidestim/db and running docker-compose up
from the root of that repository. This will give you a database and API that matches our production schema.
You must generate a JWT for your local API server so that you can define the mandatory Nextflow Secret COVIDESTIM_JWT
. Follow steps 2 and 3 of this PostgREST tutorial.
option | value |
---|---|
-profile |
api_local |
Nextflow Secret COVIDESTIM_JWT |
defined |
We run a test API server at https://api2-test.covidestim.org. Its schema is the same as the production database, but there is less model data stored in it. You'll need to generate or be provided a JWT token to insert runs via this API.
option | value |
---|---|
-profile |
api_local |
Nextflow Secret COVIDESTIM_JWT |
defined |
This is https://api2.covidestim.org. JWT token must be generated or provided.
option | value |
---|---|
-profile |
api_local |
Nextflow Secret COVIDESTIM_JWT |
defined |
Enabling dbstan integration
Keep in mind that dbstan
inserts take up a lot of space in the database, and we don't yet automatically delete old dbstan runs. Enabling the dbstan integration is not necessary to run the pipeline for test or production.
option | value |
---|---|
-profile |
dbstan_enable |
Nextflow Secret COVIDESTIM_DBSTAN_HOST |
defined, must be hostname of a Postgres server that is listening on port 5432 |
Nextflow Secret COVIDESTIM_DBSTAN_DBNAME |
defined, Postgres db name |
Nextflow Secret COVIDESTIM_DBSTAN_USER |
defined, Postgres user, see here (dbstan_writer ) |
Nextflow Secret COVIDESTIM_DBSTAN_PASS |
defined, Postgres user password |
There are parameters defined in main.nf
which can be configured at runtime in the CLI (nextflow run --arg1 val1 --arg2 val2 ...
). The available CLI options are:
-profile
- Specify
states
orcounties
- Specify
local
orslurm
- See Configuration section above for API/dbstan-specific profiles.
- Specify
--branch <tag>
: Use the Docker Hub container with tag =<tag>
when running the model. Note that the GitHub branchmaster
will exist as the Docker Hub container with tag =latest
.--key <state|fips>
--date YYYY-MM-DD
: Sets the nominal "run date" for the run. Does not necessarily need to be the same as today's date. For example, if you are generating historical results, you would set--date
to a date other than today's date.
--inputUrl <url>
: This bypasses the usual data-cleaning process, and instead passes premade input data to the instances of the model.<url>
point to a.tar.gz
file containingdata.csv
,metadata.json
, andrejects.csv
. These files must have the same schema that would be produced by the normal data-cleaning process, but need not contain all geographies. Hint: An easy way to create these three files is to take the output ofjhuStateVaxData
orcombinedVaxData
and modify it to suit your needs, then runtar -czf custom-inputs.tar.gz data.csv metadata.json rejects.csv
.--ngroups
: When you have more geographies to model than you have hourly submissions to the SLURM scheduler, set this to cause geographies to be batched together into processes that contain multiple geographies.--raw
: This will save allcovidestim-result
objects to.RDS
files, using the name of each geography as the filename, or the group id, if--ngroups
is used. This can take up a lot of space. These objects are archival objects created by the Covidestim R package.--splicedate
: Deprecated.--testtracts
: Deprecated.--alwaysoptimize
: Always use BFGS--alwayssample
: Always sample, never fallback to BFGS--n <number>
: Run the firstn
counties or states (in no particular order). Useful for testing.--s3pub
: Publish results to AWS S3. The AWS CLI must be available on the Nextflow host system, and must be configued with necessary permissions to copy files to the destination bucket.-stub
: Use the stub methods for input data generation. This will use premade data fromcovidestim-sources/example-output
, and is much faster than invokingmake
to create all input data from scratch. Useful for testing.
County production run, local
Run the county pipeline locally, inserting the results into the database, and
uploading static files to S3. Available in repository as
scripts/runLocal-counties-prod.sh
.
#!/usr/bin/env bash
export NXF_ENABLE_SECRETS=true
# NOTE: Execute this from the repository root.
branch="latest"
key=fips
date=$(date +%Y-%m-%d)
nextflow run . \
--key $key -profile "counties,local_prod,api_prod" \
--branch $branch \
--outdir $date \
--date $date
State test run, local
Run the state pipeline locally with a custom branch, without publishing or inserting results.
#!/usr/bin/env bash
export NXF_ENABLE_SECRETS=true
# NOTE: Execute this from the repository root.
branch="my-custom-branch-tag"
key=state
date=$(date +%Y-%m-%d)
nextflow run . \
--key $key -profile "states,local" \
--branch $branch \
--outdir test-state-$date \
--date $date
How do I change the number of attempts that will be made to successfully run a model, as well as the length of each attempt?
Change nextflow.config
by modifying the states
and counties
profiles.
How do I change how Stan is invoked?
See covidestim-batch for available CLI options. To change functionality that is not modifiable via this CLI, modify covidestim-batch
yourself, and rebuild a local webworker
container so that Nextflow excecutes your modified script.
How do I pass made-up case/death data to the model?
Use the --inputUrl
CLI flag (see Optional Flags). Alternatively, use the model outside the Nextflow workflow, which may be easier in some circumstances, like testing new kinds of input data.
How do I run the workflow using an updated or different version of the model? Or a new version of the webworker
container?
For running a custom model version, invoke the --branch
flag when issuing the nextflow run
command in the terminal. <branch>
must be the name of a tag which exists on Docker Hub. If there doesn't exist a container (tag) for that branch or tag, you'll need to push a new branch or tag to the GitHub remote at covidestim/covidestim
, and then set up a rule in Docker Hub so that it auto-builds that branch or tag. You need special privileges to set these rules. You can also build a container locally, and push it to Docker Hub using docker push
.
For running a custom webworker container that doesn't have the latest
tag on GitHub, you'll need to modify the container
directives in all process definitions that list covidestim/webworker:latest
as their container. To find which processes do this, run find main.nf src/ -type f | xargs grep --color container
from the repository root.
Important: For running a new model that is now the HEAD
of the branch currently being used, you need to ensure that three things have happened, lest Nextflow run your old model version instead:
- The new commit must have been successfully pushed to
covidestim/covidestim
. - It must have been built and tagged on Docker Hub (either autobuilt, or pushed to Docker Hub).
- The local registry must have the container. Locally, just run
docker pull covidestim/covidestim:TAG
. On the cluster, runrm -rf work/singularity
, which forces Singularity to rebuild the Docker Hub-sourced container the next time Nextflow executes.
The file structure is as follows:
main.nf
: The workflowsrc/*.nf
: All of the Nextflow "processes"nextflow.config
: Nextflow configuration for different execution environments (YCRC/AWS Batch) and levels of geography (counties/states)scripts/
: Example bash scripts to run the workflow in different ways.