Misc Genomics Scripts

Requirements

python 2.7 with biopython and the full scipy stack

FastA manipulation and statistics

calc.ATCG.content.bash: a unix way to quickly get A,T,C,G,N,- content from a FastA file; handles linewraps and multiple records
filter.contigs.py: cleans up a de novo assembly from SPAdes, Velvet, or IDBA (requires biopython). IDBA includes space-delimited data in their contig headers, and because SeqIO parses on whitespace, these will need to be removed or replaced (e.g., sed -i 's/ /|/g' assembly.fna). SPAdes and Velvet lack whitespace in their contig deflines, so those output files can be directly fed into this filtering script.

Example batch usage: If there are many SPAdes assembly output directories beginning with 3009 that need filtering, first cd into the parent dir containing all of the 3009* dirs and execute: for F in 3009*/contigs.fasta; do B=$(dirname $F | awk -F \/ '{print $1}'); filter.contigs.py -i $F -g -m -c 5 -l 250 -o "$B".fna; done This took me 1 min for 340 assemblies.
find.dupes.bash: given a FastA file, identifies repetitive regions, and outputs a BED file; flexible opts for defining repetitive sites; depends on BEDTools and MUMmer (nucmer)

Fetching data from NCBI

biosample2FastQ.py: download FastQ read files from NCBI for a given BioSample (or SRR) accession
genbankacc2gbk.py A GenBank file is fetched from NCBI given an accession number. When more than one accession is provided (e.g., for taxa with >1 chromosome or harboring plasmids) all records are merged into a single output file. The --min-length option ensures unusually small sequence sizes don't make their way into downstream analyses. Biopython is used to avoid a dependency on efetch.

GenBank file manipulations

extract.nucl.from.GBK.py: Specify a search term such as a gene name or locus_tag and extract its nucleotide sequence. An ERROR message is printed if the query returns more than one or no hit.
gbk2proteome.molec.weights.py: prints molecular weights of all proteins from coding sequences to stdout and provides the corresponding locus_tag and product for each as well. Handles unknown residues in proteins ('X') by estimating each as 128.16 Daltons and appends an '~' in front of the calculated molecular weight to indicate approximation.
locus_tag2faa.py: given a locus tag, an amino acid FastA is printed to stdout

Cleaning up disk space

BLAST searching requires index files that can be easily and quickly re-generated, so remove all leftover binary files within $HOME: find $HOME -type f -regextype posix-extended -regex '.*\.(nhr|nih|nin|nog|nsd|nsi|nsq|psi|psq)' -print | xargs rm -v
SPAdes keeps a lot of intermediate files, so delete these but keep essential log and FastA files to repeat the assembly if necessary: prune.SPAdes.assembly.dirs.bash $HOME

Example Installation of 'bpy2' environment

1 - get anaconda

 cd ~/Downloads/
 wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh
 chmod u+x Anaconda2-4.4.0-Linux-x86_64.sh
 bash Anaconda2-4.4.0-Linux-x86_64.sh

2 - make it available

 echo 'export PATH="$LAB_HOME/.anaconda2/bin:$PATH"' >> ~/.bashrc
 source ~/.bashrc

3 - create an environment called 'bpy2' with Python 2.7.12

 conda create -n bpy2 python=2.7.12

4 - hop into the bpy2 env

 source activate bpy2

5 - install modules within the bpy2 environment

 conda config --add channels bioconda
 conda install biopython=1.68 dendropy=4.2.0 matplotlib=2.0.0 numpy=1.12.1 pandas=0.19.2 readline=6.2 reportlab=3.4.0 ruffus=2.6.3 scipy=0.19.0 seaborn=0.7.1 scikit-learn=0.18.1 sqlite=3.13.0

6 - get out of the environment

 source deactivate

Name		Name	Last commit message	Last commit date
Latest commit History 224 Commits
examples		examples
AAI.py		AAI.py
AAIb+.py		AAIb+.py
ANI.py		ANI.py
ANIb+.py		ANIb+.py
HMM.search.summarize.py		HMM.search.summarize.py
LICENSE		LICENSE
NCBI.asm-biosample.rename.py		NCBI.asm-biosample.rename.py
NCBI.asm.fetch.py		NCBI.asm.fetch.py
NCBI.asm.summary.py		NCBI.asm.summary.py
NCBI.biosample.parse.py		NCBI.biosample.parse.py
NCBI.biosample.summary.py		NCBI.biosample.summary.py
QualAssessCleanSeqs.bash		QualAssessCleanSeqs.bash
QualAssessRawSeqs.bash		QualAssessRawSeqs.bash
README.md		README.md
SolexaQA.and.SPAdes.sh		SolexaQA.and.SPAdes.sh
TETRA.py		TETRA.py
VFDB2ProkkaDB.py		VFDB2ProkkaDB.py
backbone2bed.py		backbone2bed.py
bam2fasta.py		bam2fasta.py
bbh.clust.extract.py		bbh.clust.extract.py
bbh.cluster.py		bbh.cluster.py
bbh.pair.py		bbh.pair.py
biosample2FastQ.py		biosample2FastQ.py
calc.ATCG.content.bash		calc.ATCG.content.bash
calc.pairwise.similarities.py		calc.pairwise.similarities.py
calc.percent.core.genome.SNPs.bash		calc.percent.core.genome.SNPs.bash
calc.percent.reads.mapped.bash		calc.percent.reads.mapped.bash
compare.two.FastQ.files.py		compare.two.FastQ.files.py
concatenate.multi-fasta.py		concatenate.multi-fasta.py
estimate.genome.size.py		estimate.genome.size.py
extract.nfo.from.GBK.py		extract.nfo.from.GBK.py
extract.nucl.from.GBK.py		extract.nucl.from.GBK.py
extract.positions.multifasta.py		extract.positions.multifasta.py
extract.record.from.genbank.py		extract.record.from.genbank.py
extract.record.from.multifasta.py		extract.record.from.multifasta.py
fasta2bed.py		fasta2bed.py
fasta2genbank.py		fasta2genbank.py
fetch.ncbi.taxon.GBKs.py		fetch.ncbi.taxon.GBKs.py
filter.blast.py		filter.blast.py
filter.contigs.py		filter.contigs.py
find.dupes.bash		find.dupes.bash
gatkspandx.vcf2fasta.py		gatkspandx.vcf2fasta.py
gbk2proteome.molec.weights.py		gbk2proteome.molec.weights.py
genbank2faa.py		genbank2faa.py
genbank2fasta.py		genbank2fasta.py
genbank2ffn.py		genbank2ffn.py
genbank2gff.py		genbank2gff.py
genbankacc2gbk.py		genbankacc2gbk.py
join.multi-fasta.py		join.multi-fasta.py
locus_tag2faa.py		locus_tag2faa.py
locus_tag2ffn.py		locus_tag2ffn.py
mutate.nucleotide.sequence.records.py		mutate.nucleotide.sequence.records.py
parsnp.vcf2fasta.py		parsnp.vcf2fasta.py
plot.map-to-ref.py		plot.map-to-ref.py
plot.read-lengths.py		plot.read-lengths.py
prune.SPAdes.assembly.dirs.bash		prune.SPAdes.assembly.dirs.bash
rotate.contigs.py		rotate.contigs.py
slice.fasta.py		slice.fasta.py
slice.genbank.py		slice.genbank.py
slice.mfasta.py		slice.mfasta.py
slice.record.from.fasta.py		slice.record.from.fasta.py
split.multifasta.non-ambiguous-segments.py		split.multifasta.non-ambiguous-segments.py
split.multifasta.py		split.multifasta.py
stats.multifasta.py		stats.multifasta.py
summarize.Pilon.assembly.corrections.2.py		summarize.Pilon.assembly.corrections.2.py
summarize.Pilon.assembly.corrections.py		summarize.Pilon.assembly.corrections.py
summarize.circlator.tigs.py		summarize.circlator.tigs.py
summarize.phaster.json.py		summarize.phaster.json.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Misc Genomics Scripts

Requirements

FastA manipulation and statistics

Fetching data from NCBI

GenBank file manipulations

Cleaning up disk space

Example Installation of 'bpy2' environment

About

Releases

Packages

Languages

License

bacterial-genomics/genomics_scripts

Folders and files

Latest commit

History

Repository files navigation

Misc Genomics Scripts

Requirements

FastA manipulation and statistics

Fetching data from NCBI

GenBank file manipulations

Cleaning up disk space

Example Installation of 'bpy2' environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages