Skip to content

covidestim/dailyFlow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dailyFlow

This repository contains a Nextflow workflow used to orchestrate state- and county-level runs on AWS Batch and the Yale Center for Research Computing clusters. The workflow takes care of the following steps:

  • Cleaning for all counties and states, using scripts from covidestim-sources.
  • Performing a model run for each county or state using said input data, using the covidestim package.
  • Aggregating these results
  • Making the results available to the public by inserting them into our database, and by generating static files used on our public website, using scripts from webworker.

Nextflow:

  • Manages and executes the Singularity (on YCRC) and Docker (local execution / AWS) containers that are used to run different scripts
  • Passes data between different Nextflow processes
  • Produces logs of what happened
  • In certain situations, reruns models when they timeout or fail

Workflow diagrams

Broadly speaking, a typical workflow execution produces state or county estimates, and optionally makes them publicly available.

States

State workflow

Notes:

  • Broadcast and aggregation steps are generally handled through the use of Nextflow channel operators - see main.nf.
  • CLI options and profiles control various behaviors of each process.

Counties

County workflow

Notes:

  • CLI options and profiles control various behaviors of each process.

Getting started

Install Nextflow at nextflow.io. You need at least 21.09.0-edge because we use the Secrets feature. Nextflow will take care of downloading all containers from Docker Hub the first time you run this pipeline. Next, take note of the following configuration options. You'll want to use different configuration options depending on whether you're developing locally, testing on the cluster, or deploying production code.

Database configuration

There are four different configurations of the API:

Most configurations require setting Nextflow Secrets. Be sure to read the Secrets documentation before proceeding.

No API

This runs the pipeline without persisting any data to a database, useful for model testing.

option value
--insertApi false

Local API

You can easily spin up a local database and API by cloning covidestim/db and running docker-compose up from the root of that repository. This will give you a database and API that matches our production schema.

You must generate a JWT for your local API server so that you can define the mandatory Nextflow Secret COVIDESTIM_JWT. Follow steps 2 and 3 of this PostgREST tutorial.

option value
-profile api_local
Nextflow Secret COVIDESTIM_JWT defined

Test API

We run a test API server at https://api2-test.covidestim.org. Its schema is the same as the production database, but there is less model data stored in it. You'll need to generate or be provided a JWT token to insert runs via this API.

option value
-profile api_local
Nextflow Secret COVIDESTIM_JWT defined

Production API

This is https://api2.covidestim.org. JWT token must be generated or provided.

option value
-profile api_local
Nextflow Secret COVIDESTIM_JWT defined

Enabling dbstan integration

Keep in mind that dbstan inserts take up a lot of space in the database, and we don't yet automatically delete old dbstan runs. Enabling the dbstan integration is not necessary to run the pipeline for test or production.

option value
-profile dbstan_enable
Nextflow Secret COVIDESTIM_DBSTAN_HOST defined, must be hostname of a Postgres server that is listening on port 5432
Nextflow Secret COVIDESTIM_DBSTAN_DBNAME defined, Postgres db name
Nextflow Secret COVIDESTIM_DBSTAN_USER defined, Postgres user, see here (dbstan_writer)
Nextflow Secret COVIDESTIM_DBSTAN_PASS defined, Postgres user password

Runtime options

There are parameters defined in main.nf which can be configured at runtime in the CLI (nextflow run --arg1 val1 --arg2 val2 ...). The available CLI options are:

Required Flags

  • -profile
    • Specify states or counties
    • Specify local or slurm
    • See Configuration section above for API/dbstan-specific profiles.
  • --branch <tag>: Use the Docker Hub container with tag = <tag> when running the model. Note that the GitHub branch master will exist as the Docker Hub container with tag = latest.
  • --key <state|fips>
  • --date YYYY-MM-DD: Sets the nominal "run date" for the run. Does not necessarily need to be the same as today's date. For example, if you are generating historical results, you would set --date to a date other than today's date.

Optional Flags

  • --inputUrl <url>: This bypasses the usual data-cleaning process, and instead passes premade input data to the instances of the model. <url> point to a .tar.gz file containing data.csv, metadata.json, and rejects.csv. These files must have the same schema that would be produced by the normal data-cleaning process, but need not contain all geographies. Hint: An easy way to create these three files is to take the output of jhuStateVaxData or combinedVaxData and modify it to suit your needs, then run tar -czf custom-inputs.tar.gz data.csv metadata.json rejects.csv.
  • --ngroups: When you have more geographies to model than you have hourly submissions to the SLURM scheduler, set this to cause geographies to be batched together into processes that contain multiple geographies.
  • --raw: This will save all covidestim-result objects to .RDS files, using the name of each geography as the filename, or the group id, if --ngroups is used. This can take up a lot of space. These objects are archival objects created by the Covidestim R package.
  • --splicedate: Deprecated.
  • --testtracts: Deprecated.
  • --alwaysoptimize: Always use BFGS
  • --alwayssample: Always sample, never fallback to BFGS
  • --n <number>: Run the first n counties or states (in no particular order). Useful for testing.
  • --s3pub: Publish results to AWS S3. The AWS CLI must be available on the Nextflow host system, and must be configued with necessary permissions to copy files to the destination bucket.
  • -stub: Use the stub methods for input data generation. This will use premade data from covidestim-sources/example-output, and is much faster than invoking make to create all input data from scratch. Useful for testing.

Examples

County production run, local

Run the county pipeline locally, inserting the results into the database, and uploading static files to S3. Available in repository as scripts/runLocal-counties-prod.sh.

#!/usr/bin/env bash
export NXF_ENABLE_SECRETS=true

# NOTE: Execute this from the repository root.

branch="latest"
key=fips
date=$(date +%Y-%m-%d)

nextflow run . \
  --key $key -profile "counties,local_prod,api_prod" \
  --branch $branch \
  --outdir $date \
  --date $date 

State test run, local

Run the state pipeline locally with a custom branch, without publishing or inserting results.

#!/usr/bin/env bash
export NXF_ENABLE_SECRETS=true

# NOTE: Execute this from the repository root.

branch="my-custom-branch-tag"
key=state
date=$(date +%Y-%m-%d)

nextflow run . \
  --key $key -profile "states,local" \
  --branch $branch \
  --outdir test-state-$date \
  --date $date 

FAQ

How do I change the number of attempts that will be made to successfully run a model, as well as the length of each attempt?

Change nextflow.config by modifying the states and counties profiles.

How do I change how Stan is invoked?

See covidestim-batch for available CLI options. To change functionality that is not modifiable via this CLI, modify covidestim-batch yourself, and rebuild a local webworker container so that Nextflow excecutes your modified script.

How do I pass made-up case/death data to the model?

Use the --inputUrl CLI flag (see Optional Flags). Alternatively, use the model outside the Nextflow workflow, which may be easier in some circumstances, like testing new kinds of input data.

How do I run the workflow using an updated or different version of the model? Or a new version of the webworker container?

For running a custom model version, invoke the --branch flag when issuing the nextflow run command in the terminal. <branch> must be the name of a tag which exists on Docker Hub. If there doesn't exist a container (tag) for that branch or tag, you'll need to push a new branch or tag to the GitHub remote at covidestim/covidestim, and then set up a rule in Docker Hub so that it auto-builds that branch or tag. You need special privileges to set these rules. You can also build a container locally, and push it to Docker Hub using docker push.

For running a custom webworker container that doesn't have the latest tag on GitHub, you'll need to modify the container directives in all process definitions that list covidestim/webworker:latest as their container. To find which processes do this, run find main.nf src/ -type f | xargs grep --color container from the repository root.

Important: For running a new model that is now the HEAD of the branch currently being used, you need to ensure that three things have happened, lest Nextflow run your old model version instead:

  1. The new commit must have been successfully pushed to covidestim/covidestim.
  2. It must have been built and tagged on Docker Hub (either autobuilt, or pushed to Docker Hub).
  3. The local registry must have the container. Locally, just run docker pull covidestim/covidestim:TAG. On the cluster, run rm -rf work/singularity, which forces Singularity to rebuild the Docker Hub-sourced container the next time Nextflow executes.

File structure/Processes

The file structure is as follows:

  • main.nf: The workflow
  • src/*.nf: All of the Nextflow "processes"
  • nextflow.config: Nextflow configuration for different execution environments (YCRC/AWS Batch) and levels of geography (counties/states)
  • scripts/: Example bash scripts to run the workflow in different ways.

About

Nextflow workflow for daily county/state runs

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published