Luc4IR (pronounced Lucifer) is a Java implementation of sparse indexing and retrieval. The code is distributed in the hope that it'll be useful for IR practitioners and students who want to get started with retrieving documents from a collection and measure effectiveness with standard evaluation metrics.
Due to the lack of file size restrictions, the index could not be made available on this repository. To recreate the index, download the TREC disks 4/5 collection from here.
After downloading the collection and unzipping it, build the index by executing the following script
./index_trecd45 <path to the collection>
You may even download the index from this shared OneDrive folder.
For retrieval, simply run the script
./retrieve_trecd45.sh <INDEX-PATH> <QUERY FILE> <QRELS FILE>
which executes a series of queries from a TREC formatted topic file (using the LM-Dir retrieval model) and reports MAP.
Another small test collection that is included in the repository is the ToucheV2 dataset. To run BM25 just execute the following commands, which will prepare the index and execure retrieval on 49 test queries. The result file, named touche.res
is saved in the project base folder, which can then be evaluated with trec_eval.
./index_touche.sh
./retrieve_touche.sh