Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query database size and predictmatch matches #1

Open
imrambo opened this issue Jun 26, 2020 · 1 comment
Open

Query database size and predictmatch matches #1

imrambo opened this issue Jun 26, 2020 · 1 comment

Comments

@imrambo
Copy link

imrambo commented Jun 26, 2020

Hello,

I am using SpacePHARER for a project, and had a few questions about how query database size can affect predictmatch results. I ran spacepharer predictmatch for two query databases, both times against the same target database. Only 31% of hits from the first run (which used a smaller query database) were found in the second run.

#First run spacer query DB:
2,206 spacers from 38 genomes.

#First run results:
161 spacer hits to viral target DB.

#Second run spacer query DB:
15,730 spacers from 450 genomes.

#Second run results:
1,764 spacer hits to viral target DB.
50 of the 161 spacers with hits in the first run were retained in the second run output.

Main question

I am wondering why spacers from the smaller query database that had a hit in the first run are not present from the output of the second run which has the increased query database size. Does the --simple-best-hit setting affect this?

The tmp folder was emptied after each run.

Environment

SpacePHARER Version: 2.fc5e668
Conda
Ubuntu 16.04

#The same parameters are used for both runs:
--strand 2 --fmt 2 --fdr 0.01 --simple-best-hit 1 --use-all-table-starts 1 --translate 1 --search-type 1 --translation-table 11 --rescore-mode 0 --num-iterations 4 --cov 0.50 --e-profile 0.0001 -s 5.70 --report-pam 1 --gap-open 16 --gap-extend 2 --cov-mode 0 --min-seq-id 0.95 --max-seq-id 1.00 --orf-start-mode 1 --remove-tmp-files 0

I'm reading through the supplemental info on the bioRxiv paper to try and understand the algorithm better.

Thank you very much for your time and help.

Cheers,
Ian

@RuoshiZhang
Copy link
Member

Hi,

Could you please provide the full log of the two runs?
What is the size of your target DB? Are the matches(virus-host pairs) of the rest of the hits also not reported in the output?
The --simple-best-hit parameter in SpacePHARER is fixed and should not be related to this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants