Antonio Mallia, Michał Siedlaczek, Joel Mackenzie, and Torsten Suel
This is the docker image for the PISA: Performant Indexes and Search for Academia (v0.6.6) conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019. This image is available on Docker Hub has been tested with the jig at commit f6c6ef4 (19/6/2019).
- Supported test collections:
robust04
,core17
,core18
(newswire);gov2
,cw09b
,cw12b
(web) - Supported hooks:
init
,index
,search
The following jig
command can be used to index TREC disks 4/5 for robust04
:
python run.py prepare
--repo osirrc2019/pisa \
--tag v0.1.1 \
--collections Robust04=/data/collections/disk45
The following jig
command can be used to perform a retrieval run on the collection with the robust04
test collection.
python run.py search \
--repo osirrc2019/pisa \
--tag v0.1.1 \
--collection Robust04 \
--topic topics/topics.robust04.txt \
--output $(pwd)/output \
--qrels $(pwd)/qrels/qrels.robust04.txt
The PISA image supports the following retrieval methods:
- BM25: k1=0.9, b=0.4 (Robertson et al., 1995)
The default search system can be changed. For example, we allow a few different index compression and search algorithms
to be used. These options are supplied using --opts [option]=[value]
stemmer
can be eitherkrovetz
orporter2
, default isporter2
.compressor
can beopt
(Partitioned Elias Fano),block_interpolative
(Interpolative),block_simdbp
(SIMD-BP128), orblock_optpfor
(OPT-PFor Delta),block_simdbp
is the default. Multiple compressors can be passed using a comma delimiter, eg:--opts compressor="block_simdbp,opt"
.block_type
can be eitherfixed
orvariable
, default isvariable
. Iffixed
is used,block_size
must also be supplied, whereblock_size
is a positive integer.skip_reordering=1
if provided document reordering will not be performed
stemmer
is the same as above, and is used for stemming queries.compressor
is the same as above, and should match the givencompressor
used during indexing. However, only one singlecompressor
can be provided at a time.algorithm
can bewand
,maxscore
, orblock_max_wand
.block_max_wand
is the default.block_type
is the same as above, and should match the givenblock_type
used during indexing. For search,block_size
is not required.
For indexing, the corpus name defines the indexing configuration. The following values are supported:
- robust04 - TREC Disks 4&5.
- core17 - the New York Times corpus.
- core18 - the TREC Washington Post (WAPO) corpus.
- gov2 - the TREC GOV2 corpus.
- cw09b - the TREC ClueWeb09 corpus.
- cw12b - the TREC ClueWeb12 corpus.
As discussed above, the default configuration is as follows:
- Porter 2 Stemming
- SIMD-BP128 compression
- Variable-sized blocks and Block-Max WAND, leading to the "Variable BMW" algorithm
Since the Variable-sized blocks depend on a parameter, lambda, we have searched for
the correct value of lambda offline, and hardcoded these values into the lamb()
method within the index
call. We found values of lambda that result in a mean
block size of 40 elements per block, with a possible error rate of plus/minus 0.5
elements. Please note that these lambda values only apply to the default parsing
and indexing setup, and would need to be searched again if settings are changed
(for example, if a different stemmer was used).
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2004 Robust Track Topics | 0.2534 | 0.3120 | 0.4221 |
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2017 Common Core Track Topics | 0.2078 | 0.4260 | 0.3898 |
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2018 Common Core Track Topics | 0.2384 | 0.3500 | 0.3927 |
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2004 Terabyte Track: Topics 701-750 | 0.2638 | 0.4776 | 0.4070 |
TREC 2005 Terabyte Track: Topics 751-800 | 0.3305 | 0.5487 | 0.5073 |
TREC 2006 Terabyte Track: Topics 801-850 | 0.2950 | 0.4680 | 0.4925 |
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2010 Web Track: Topics 51-100 | 0.1009 | 0.2521 | 0.1509 |
TREC 2011 Web Track: Topics 101-150 | 0.1093 | 0.2507 | 0.2177 |
TREC 2012 Web Track: Topics 151-200 | 0.1054 | 0.2100 | 0.1311 |
BM25 | MAP | P@30 | NDCG@20 |
---|---|---|---|
TREC 2013 Web Track: Topics 201-250 | 0.0449 | 0.1940 | 0.1529 |
TREC 2014 Web Track: Topics 251-300 | 0.0217 | 0.1240 | 0.1484 |
The following is a quick breakdown of what happens in each of the scripts in this repo.
The Dockerfile
derives from the official PISA docker image. Additionally, it installs dependencies (python3, etc.), copies scripts to the root dir, and sets the working dir to /work.
The init
script is empty since all the initialization is executed during Docker image building.
The index
script reads a JSON string (see here) containing at least one collection to index (including the name, path, and format).
The collection is indexed and placed in a directory, with the same name as the collection, in the working dir (i.e., /work/robust04
).
At this point, jig
takes a snapshot and the indexed collections are persisted for the search
hook.
The search
script reads a JSON string (see here) containing the collection name (to map back to the index directory from the index
hook) and topic path, among other options.
The retrieval run is performed (using additional --opts
params, see above) and output is placed in /output
for the jig
to evaluate using trec_eval
.
- Documentation reviewed at commit
8f88235
(2019-06-17) by Ryan Clancy