Skip to content

Preparing Your Data

alexandriai168 edited this page Aug 14, 2023 · 11 revisions

This pipeline requires certain file naming conventions and information from the primer_data.csv file.

primer_data.csv

Many of the scripts in this pipeline pull from a file named primer_data.csv. This file contains information about your primer and parameters for dada2 to infer ASV's. If you're using primers other than Dloop, Mifish, or C16, you will need to fill out this information for your primer.

Sample Primer Data:

name locus_shorthand seq_f primer_length_f seq_r primer_length_r F_qual R_qual tapestation_amplicon_length_F tapestation_amplicon_length_R max_amplicon_length min_amplicon_length max_trim overlap db_name known_hashes_name
MFU MFU GCCGGTAAAACTCGTGCCAGC 21 CATAGTGGGGTATCTAATCCCAGTTTG 27 30 30 222 222 185 163 100 20 MURI_MFU.fasta MFU_prev_hashes.csv
DL DL TCACCCAAAGCTGRARTTCTA 21 GCGGGTTGCTGGTTTCACG 19 29 20 429 429 475 200 450 12 Cetacean_Dloop_Baker_NWFSC.fasta DL_prev_hashes.csv
C16 C16 GACGAGAAGACCCTAWTGAGCT 22 AAATTACGCTGTTATCCCT 19 NA NA 249 249 320 NA NA NA ceph_C16_sanger.fasta C16_prev_hashes.csv

Guide to Primer Data Fields:

name long primer name
locus_shorthand short primer name
seq_f forward primer sequence
primer_length_f length of the forward primer
seq_r reverse primer sequence
primer_length_r length of the forward primer
F_qual quality to trim to when determining TruncLen (dada2) for forward sequences
R_qual quality to trim to when determining TruncLen (dada2) for reverse sequences
tapestation_amplicon_length_F amplicon length of forward reads before sequencing (with primers, adapters, barcodes, etc.)
tapestation_amplicon_length_R amplicon length of reverse reads before sequencing (with primers, adapters, barcodes, etc.)
max_amplicon_length For filtering by length after merging Forward and reverse reads. Maximum amplicon length to keep.
min_amplicon_length For filtering by length after merging Forward and reverse reads. Minimum amplicon length to keep.
max_trim maximum quality to trim to for TruncLen (dada2)
overlap minimum overlap when merging forward and reverse reads (minOverlap)
db_name name of taxonomy database to use for the primer
known_hashes_name name of database of known hashes to use for the primer

Sample Naming Conventions

The cutadapt portion of the metabarcoding pipeline (primer_trimming.sh) is dependant on the naming of each sample. Below is an example of how the fastq files must be named.

photo of sample naming scheme

Primer Primer used in sample. Must match primer_data.csv
SampleID SampleID of sample. Can be formatted differently but must match Hake_2019_metadata.csv to successfully create .html file.
Dilution Amount of dilution performed on sample due to environmental contamination
Replicate Technical Replicate number
Miseq Sample Data Sample data that is created by illumina sequencing. This appended to the output files by the illumina sequencer.