Workflow for processing nifH amplicon data sets

This repository contains all post-pipeline software stages and data deliverables described in Morando, Magasin et al. 2024. The workflow was used to process nearly all published nifH amplicon MiSeq data sets that existed at the time of publication, as well as two new data sets produced by the Zehr Lab at UC Santa Cruz. The samples are shown in this map which links to an interactive Google map with study names, sample IDs, and collection information for each sample.

Workflow overview

The following figure shows the workflow: DADA2 ASVs were created by our DADA2 nifH pipeline (green). Post-pipeline stages (lavender), each executed by a Makefile or Snakefile, were used to gather the ASVs from all studies, filter the ASVs for quality, and annotate them, as well as to download sample-colocated environmental data from the Simons Collaborative Marine Atlas Project (CMAP). The nifH ASV database generated by the workflow will support future research into N₂-fixing marine microbes. The published database and any updated versions are available within the WorkspaceStartup directory, both as nifH_ASV_database.tgz as well as the R image, workspace.RData. The published database is also available at https://doi.org/10.6084/m9.figshare.23795943.v1.

Running the workflow

The workflow requires the DADA2 nifH pipeline as well as its ancillary tools. Please see the Installation directory in the pipeline repository. Then you must install the additional packages required by the post-pipeline stages as described in these Installation instructions.

The DADA2 nifH pipeline outputs for all studies in the nifH ASV database are provided in the Data directory. So you do not need to run the pipeline to recreate the database. However, if you wish to run the pipeline, the parameters files used for each study are included in Data. You are free to modify them.

Each of the post-pipeline stages 1 through 6 can be run -- in order -- by entering the associated directory and running "make" from your shell's command line. For example, if I wanted to run the GatherAsvs stage I would do the following from the command line:

(base) [jmagasin@thalassa]$ conda activate nifH_ASV_workflow
(nifH_ASV_workflow) [jmagasin@thalassa]$ cd GatherAsvs
(nifH_ASV_workflow) [jmagasin@thalassa]$ make &> log.18July2023.txt &

Here I am using a BASH shell (recommended). First I activate the nifH_ASV_workflow environment, a critical step that ensures that all tools and packages needed by the workflow are available. Note how activation changes the prompt to begin with "(nifH_ASV_workflow)" on line two. On the third line, I make the GatherAsvs stage and save the Makefile messages to a log file. Most stages take at least a few minutes to complete so I run them in the background (the trailing &).

Please see documentation at the top of each Makefile for an overview of the stage.

Name		Name	Last commit message	Last commit date
Latest commit History 198 Commits
AnnotateAuids		AnnotateAuids
CMAP		CMAP
Data		Data
FilterAuids		FilterAuids
GatherAsvs		GatherAsvs
GatherMetadata		GatherMetadata
Installation		Installation
WorkspaceStartup		WorkspaceStartup
images_for_readme		images_for_readme
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Workflow for processing nifH amplicon data sets

Workflow overview

Running the workflow

About

Releases

Packages

Contributors 3

Languages

License

jdmagasin/nifH-ASV-workflow

Folders and files

Latest commit

History

Repository files navigation

Workflow for processing nifH amplicon data sets

Workflow overview

Running the workflow

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages