This repository contains the scripts, the experimental data, and the evaluations of Validating Synthetic Usage Data in Living Lab Environments. Below, you will find detailed instructions on how to reproduce the experiments and overviews of the scripts and the project structure.
Prerequisites: Docker, Python (The experiments were run on Ubuntu 20.04)
pip install -r requirements.txt
1.2 Install PyClick:
python -m pip install git+https://github.com/markovi/PyClick.git
1.3 Obtain the TripClick dataset and place it in the directory corresponding to ir_datasets
or alternatively create a symbolic link to the TripClick dataset files:
ln -s path/to/tripclick/ ~/.ir_datasets/
1.5 Download TripJudge and check if the path to benchmark.exp.tripjudge.2.grade.csv
in src/benchmark_bar.py
is correct.
git clone https://github.com/sophiaalthammer/tripjudge
python start_db.py
python index_logs.py
1.8 Index the TripClick dataset with Pyterrier/ir_datasets
. The index files are written to ./indices/tripclick/
python index_tripclick.py
python topk_queries.py
Train the click models with 1-100 sessions and 10 trials. The files with the click model parameters are named by the scheme <click-model>.<sessions>.<trial>.json
, e.g., ./experimental_results/dctr/params/dctr.100.1.json
for the DCTR model with 100 sessions for the 1st trial:
python train.py
Please note that we cannot share the click model parameters publicly as the license of TripClick does not allow to share statistical models or any further resources created based on the datasets. However, after obtaining the TripClick dataset, it is possible to recreate them with the commands above.
Determine the Jaccard similarity between the 20 first results for each of the 50 queries for the IRM and LRM systems and writes them to CSV files by the names of ./experimental_results/jacc.irm.csv
and ./experimental_results/jacc.irm.csv
. Afterwards, plot them in heatmaps:
python jacc_irm.py && python jacc_lrm.py && python jacc_plt.py
Evaluate the interpolated retrieval method (IRM) and the lexical retrieval methods (LRM) based on the editorial relevance judgments of TripJudge. Afterwards, make bar plots for P@10, nDCG, and AP:
python benchmark_irm.py && python benchmark_lrm.py && python benchmark_plt.py
Determine the LogLikelihood for the IRM and LRM systems evaluated by different click models over an increasing number of sessions (from 1 to 100 sessions). The scripts output CSV files with file names depending on the click model and the number of sessions, e.g., ./experimental_results/dctr/ll/dctr.ll.irm.50.csv
for the DCTR model with 50 queries.
python ll_irm.py && python ll_lrm.py
Plot the Log-Likelihood over 100 sessions for all click models and the IRM and LRM systems:
python ll_plt.py
Plot the Kendall's tau heatmaps based on the Log-Likelihood over different numbers of queries and sessions.
python ll_ktau.py
Determine the outcomes of interleaving experiments for the IRM and LRM systems.
python outcome_irm.py && python outcome_lrm.py
Plot the bar chart that compares the outcomes of the IRM and LRM interleaving experiments:
python outcome_plt.py
Plot the heatmap with the Jaccard similarities between winning and losing queries:
python jacc_queries_irm.py
python outcome_ktau.py && python outcome_rel_err.py
Directory | Description |
---|---|
./src/ |
Contains the Python scripts that are used to reproduce the experiments and figures. |
./data/ |
Place the TripClick session logs here (the uncompressed logs.tar.gz ). |
./experimental_results/ |
Contains the experimental data and figures. |
Script | Description | Output |
---|---|---|
util.py | Contains different utility functions. | - |
start_db.py | Creates and starts a Docker Container running a MongoDB instance. | Docker container running MongoDB |
stop_db.py | Stops the MongoDB Docker container and optionally removes it. | - |
index_tripclick.py | Indexes the TripClick dataset with ir_datasets | Writes index files to disk, default directory: ./indices/tripclick/ |
index_logs.py | Parses the sessions logs and writes them into a MongoDB instance. | - |
topk_queries.py | Determines the top-k queries and writes them to a csv file. | CSV file containing the top k queries, e.g., ./experimental_results/train.head.50.csv . |
train.py | Train a click model with m sessions and n trials | File with click model parameters by the scheme <click-model>.<sessions>.<trial>.json , e.g., ./experimental_results/dctr/params/dctr.100.1.json for the DCTR model with 100 sessions for the 1st trial. |
jacc_irm.py | Determines the Jaccard similarity between the 20 first results for each of the 50 queries for the IRM systems and writes them to a CSV file. | ./experimental_results/jacc.irm.csv |
jacc_lrm.py | Determines the Jaccard similarity between the 20 first results for each of the 50 queries for the LRM systems and writes them to a CSV file. | ./experimental_results/jacc.lrm.csv |
jacc_plt.py | Plots the Jaccard similarity in a heatmap. | ./experimental_results/figures/jacc_sim.pdf |
benchmark_irm.py | Evaluates the interpolated retrieval method (IRM) based on the editorial relevance judgments of TripJudge. | ./experimental_results/benchmark.irm.tripjudge.2.grade.csv |
benchmark_lrm.py | Evaluates the lexical retrieval methods (LRM) based on the editorial relevance judgments of TripJudge. | ./experimental_results/benchmark.lrm.tripjudge.2.grade.csv |
benchmark_bar.py | Plots and compares IRM and LRM systems. | ./experimental_results/figures/benchmarks.tripjudge.pdf |
ll_irm.py | Determines the LogLikelihood for the IRM systems evaluated by different click models over an increasing number of sessions (from 1 to 100 sessions). | Outputs different csv files depending on the click model and number of sessions, e.g., ./experimental_results/dctr/ll/dctr.ll.irm.50.csv for the DCTR model with 50 queries. |
ll_lrm.py | Determines the LogLikelihood for the LRM systems evaluated by different click models over an increasing number of sessions (from 1 to 100 sessions). | Outputs different csv files depending on the click model and number of sessions, e.g., dctr.ll.lrm.50.csv for the DCTR model with 50 queries. |
ll_plt.py | Plots the LogLikelihood over an increasing number of sessions for either 5 or 50 queries for the IRM and LRM systems. | dctr.dcm.sdbn.irm.5.50.ll.pdf and dctr.dcm.sdbn.lrm.5.50.ll.pdf |
ll_ktau.py | Plots the Kendall's tau heatmaps based on the LogLikelihood. | dctr.ktau.ll.heatmaps.pdf and dcm.sdbn.ktau.ll.heatmaps.pdf |
outcome_irm.py | Determines the outcomes of interleaving experiments for the IRM systems. | dctr.outcome.irm.50.csv |
outcome_lrm.py | Determines the outcomes of interleaving experiments for the LRM systems. | dctr.outcome.lrm.50.csv |
outcome_bar.py | Plots the bar chart that compares the outcomes of the IRM and LRM interleaving experiments. | bar.plots.outcome.50.pdf |
outcome_ktau.py | Plots the Kendall's tau heatmaps based on the outcome measure. | ktau.outcome.heatmaps.pdf |
outcome_rel_err.py | Plots the relative error of Kendall's tau over an increasing number of sessions. | dctr.sdbn.dcm.rel.err.50.pdf |