This script can extract FASTA/Q subseq extremely fast but consumes memory a lot.
$ pypy fastx_subseq.py -f FASTX -l SEQ_NAME_LIST # No detailed information printed.
or
$ pypy fastx_subseq.py -f FASTX -l SEQ_NAME_LIST -o OUT_DIR -v
# Warning: This script is memeory-consuming! #
Initializing...
Extracting...
[================================================================================] Processing 100.0%...
All done.
Without a specified "-o" option, all extracted sequences will be put in a fold named extracted_sequences
in current work directory in default. Make sure you have permission.
An example:
import sys
sys.path.append('/path/to/fastx_subseq/') # If necessary.
from fastx_subseq import Fastx
f = Fastx(FASTX, verbose=True) # To process verbosely, set "verbose=True" (default).
f.ExtractInfo() # To extract the FASTX's info (consumes memory).
f.FetchSeq(SEQ_NAME_LIST, OUT_DIR) # To fetch sequences.
f.ReleaseMemory() # Recommended.
For more details:
>>> from fastx_subseq import Fastx
>>> help(Fastx)
FASTX
is supposed to be a file in FASTA format or 4-line FASTQ format.
And SEQ_NAME_LIST
is a plain text, containing sequence names, one per line (no space), such as:
$ head SEQ_NAME_LIST
E00247:343:HYMLVCCXX:8:1101:11363:40583
E00247:343:HYMLVCCXX:8:1101:1813:43941
E00247:343:HYMLVCCXX:8:1101:23023:68658
E00247:343:HYMLVCCXX:8:1101:23409:33041
E00247:343:HYMLVCCXX:8:1101:2656:67058
OUT_DIR
refers to a customized output directory (default: "./extracted_sequences/").
If few subseqs need to be extracted from a FASTA file, samtools is suggested:
$ samtools faidx INPUT_FASTA # Build an index for your FASTA file first.
$ samtools faidx INPUT_FASTA SEQ_NAME > OUTPUT_FASTA # Extract the subseq.