-
Notifications
You must be signed in to change notification settings - Fork 0
/
EXAMPLE.txt
119 lines (84 loc) · 5.23 KB
/
EXAMPLE.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
################################################################################
####################### Example of the DADA2 nifH pipeline #####################
################################################################################
Copyright (C) 2023 Jonathan D. Magasin
The steps below run the DADA2 nifH pipeline on a small data set comprised of
ten Arctic samples from Harding et al. 2018, five from the 0.2-3um size fraction
and five from >3um. Each size fraction is processed completely independently.
Execute each of the steps 1-4 from the command line in your unix shell, which
can be whatever you like (bash, tcsh, etc.). The DADA2 pipeline (step 4) takes 5
minutes to run on our workstation -- for this abridged data set with just the
first 10K read pairs and only ten samples. Full data sets with a few hundred
samples might take several hours.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
0. One time: Install tools and create a conda environment called DADA2_nifH that
makes the tools accessible. Described in Installation/INSTALL.txt.
1. Activate the DADA2_nifH conda environment. If your shell prompt starts with
"(DADA2_nifH)" then the environment is already active.
conda activate DADA2_nifH
2. Go to the example directory (that contains EXAMPLE.txt, this file). Input
files for the example are already in this directory. Pipeline outputs will
be placed here too. [For your own analyses you should work outside of the
DADA2_nifH_pipeline directory!]
cd ~/DADA2_nifH_pipeline/Example
3. Gather the FASTQs into a directory hierarchy rooted at LinksToFastqs. The
hierarchy is organized according to the processing groups defined in
fastqMap.tsv. You only need to do this step once (not each time you run the
pipeline). See documentation at the top of fastqMap.tsv. Also see help
provided by running "organizeFastqs.R --help"
FASTQ filename requirements are described in the help for organizeFastqs.R.
organizeFastqs.R fastqMap.tsv
4. Run the DADA2 pipeline. See notes in params.example.csv. You can run the
pipeline multiple times to try different parameters for DADA2. When results
from early stages are detected (cutadapt and error model creation), they are
reused. See documentation with "run_DADA2_pipeline.sh --help".
run_DADA2_pipeline.sh params.example.csv > log.txt
The pipeline is now running. It will take ~18 minutes to complete.
Suggestion: Rather than wait 18 minutes before you can interact with the
shell, enter control-z. This will temporarily suspend the current job (which
is the pipeline). After control-z, do the following steps to: see the
pipeline job [jobs]; restart the pipeline as a background process [bg];
verify that the pipeline is running in the background [jobs]; and then look
at the last 20 lines of the log [tail].
jobs
bg
jobs
tail -n 20 log.txt
You can do the tail command whenever you like to see how the pipeline is
progressing, or jobs to see if it is still running.
In particular, you do *not* have to remain logged in to your machine until
the job completes. For long jobs, I usually log out of our workstation and
ssh back in every few hours to see how the job is going (using the 'ps'
command [not 'jobs'] and 'tail' to see if there is a "Done!" message (or
error) at the end of the log.
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Key outputs:
1. See the log file. It summarizes each stage, flags warnings / errors, and
points to outputs of interest.
2. A summary of reads that passed cutadapt primer trimming is in:
Data.nobackup/Data.trimmed/summary.cutadapt.txt
3. To get an idea of how many NifH-like sequences are in your data, see:
Data.nobackup/Data.NifH_prefilter/summary.NifH_prefilter.txt
These are the reads that were used to create the error models used by DADA2.
You might also spot some samples that look problematic (e.g. with very few
nifH-like reads), which you might decide to drop.
4. Inspect the error model plots (.pdfs) in Data.nobackup/ErrorModels (separate
for each size fraction). You will need some guidance from the DADA2 web
pages to interpret these. If they look terrible, then you should try to
improve them before working the the ASVs. (Bad error models are likely to
cause rarer ASVs to not be inferred.)
5. DADA2 output. For each size fraction and for the date you ran the pipeline,
there is separate output. For example:
Data.nobackup/Dada2PipeOutput/Filt_3/Out.2023Jul18/
within which you should have a look at:
- The DADA2 log file has messages specific to DADA2 (stage 4 -- not the
top level log file).
- TextData: ASVs sequences and abundance tables.
- Plots: This directory includes quality plots of forward and reverse
reads, critical for assessing data quality and tuning parameters for
DADA2
- readCountsDuringProcessing.....png: This plot shows the percentage of
reads retained at each stage of the pipeline. Use it to spot excessive
loss of reads ---> parameter adjustments. E.g. are reads dropped due to
poor quality? Or failure to overlap at the merge step?
-- END --