GenSum

RNA-seq gene expression counter

Counts reads on genes. Fast. Inspired by htseq-count GenSum support two quantification methods: union and strict, which are compatible with htseq-count union and intersection-strict. GenSum and htseq-count produce identical counts in the union mode, and minor differences in strict mode (1 in 35000 reads). GenSum is much much faster. A 15 million read bam file can be quantified in around 5s while htseq-count takes about 11 minutes.

Installation

For now manually build using the rust toolchain (https://rustup.rs/).

cargo build --release
# binary ends up here:
./target/release/gensum --help

Docker containers

Docker containers for gensum can be found on dockerhub and ghcr.

Input

Required input is a GTF file that contains the exons and the gene-ids. GenSum uses a very naive parser to extract only 'exon' features. The GTF exon features are required to have an entry in the attributes that contains: gene_id: "<the gene id to count>". It it recommended to use the files generated by the ensembl team at: http://ftp.ensembl.org/pub/current_gtf/

The second input is the .bam file created by an aligner. TopHat/HiSat2/STAR should all work fine. Stranded libraries as well as paired end data are supported. When using a stranded RNA library supply the library type using the --strandness flag to restrict counting only the correctly oriented reads. Paired-end reads are expected to be oriented inwards (--->...<---). Sorting the bam file by position or name is not required, but paired end reads are stored until the mate is encountered which can affect memory usage.

Options

USAGE:
    gensum [FLAGS] [OPTIONS] --bam <FILE> --gtf <FILE>

FLAGS:
    -h, --help        Prints help information
    -n, --nosingle    Do not count paired-end reads that have only 1 mapped end. Default allows one mapped end. Only
                      affects paired-end reads.
    -d, --usedups     Also count read (pairs) marked as (optical) duplicate, default excludes duplicates. Requires a bam
                      files processed with a markdups tool
    -V, --version     Prints version information

OPTIONS:
    -b, --bam <FILE>                 The bam file to quantify
    -f, --gtf <FILE>                 The .gtf reference transcriptome file
    -q, --mapq <0-255>               The minimum required mapping quality to include [default: 10]
    -o, --output <FILE>              The output file
    -m, --method <qmethod>           The quantification method, 'strict' or 'union'. 'union' counts all genes that
                                     overlap any part of the reads, 'strict' requires the read to map within the exon
                                     boundaries. [default: union]  [possible values: union, strict]
    -s, --strandness <strandness>    The RNA library strandness [F]orward, [R]everse or [U]nstranded [default: U]
                                     [possible values: F, R, U]

Output

The output is a simple two column <tab> delimited file. The first column contains the gene_id or a descriptive name for unassigned reads. The second column the counts on that gene.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
Changelog.md		Changelog.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenSum

RNA-seq gene expression counter

Installation

Docker containers

Input

Options

Output

About

Releases

Packages

Languages

License

NKI-GCF/gensum

Folders and files

Latest commit

History

Repository files navigation

GenSum

RNA-seq gene expression counter

Installation

Docker containers

Input

Options

Output

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages