config
contains configurations filessemanticscholar
contains the raw data files from https://api.semanticscholar.org/corpus/. The files' ending was changed tojson.gz
with the./src/etl/rename.py
scriptelasticsearch
contains the index data with the semanticscholar corpussrc
contains the modulestraining
contains the provided training docs, group defintitions, a python script to generate query sequences, the training corpus for feature engineeringevaluation
contains the provided evaluation files and a python script to validate a submission
- training, evaluation and submission files from the fair-trec website
- pandas
- fairsearchdeltr
- elasticsearch python client
- elasticsearch docker instance
docker pull docker.elastic.co/elasticsearch/elasticsearch:7.0.1
docker build -t elasticsearch-ltr ./config
docker run -d --rm --name es \
-p 9200:9200 -p 9300:9300 \
-e "discovery.type=single-node" \
-e "http.cors.enabled=true" \
-e "http.cors.allow-origin=*" \
-e "http.cors.allow-headers=X-Requested-With,X-Auth-Token,Content-Type,Content-Length,Authorization" \
-e "http.cors.allow-credentials=true" \
-e "ES_JAVA_OPTS=-Xms2g -Xmx2g" \
-v `pwd`/elasticsearch:/usr/share/elasticsearch/data \
elasticsearch-ltr
python ./src/etl/data_to_es.py
python ./src/etl/remove_missing_ids.py
python -m src.runs.random
python -m src.runs.lambdamart
see the example scripts in src/runs
- the corpus contains
46 947 044
unique documents - the training sample contains
4641
documents (4490
unique docs) and652
queries - the cleand training sample contains
557
queries, as some doc_ids are missing in the corpus (see./src/etl/remove_missing_ids.py
) - 3863 docs from the training sample are included in the corpus
- the length of each ranking ranges from
2
to26
docs with an average of7
docs - on average arround
50.94%
of docs per query are not relevant
- contains the final run script that build on all other modules
- imports and maps the raw data to elasticsearch index
- processes the input training and group files (
inputhandler
) - layer between program modules and elasticsearch (
corpus
)
- contains the learning to rank model to rerank the document sets: DELTR algorithm for training
- provides implementations of the evaluation measures (
evaluation.py
) - provides module to generate features from the corpus (
features.py
)
- contains modules for command line args, logger file initialization and IO functionalities
- contains test files and scripts