ACORDAR 2.0 is a test collection for ad hoc content-based dataset retrieval, which is the task of answering a keyword query with a ranked list of datasets. Keywords may refer to the metadata and/or the data of each dataset. Compared with ACORDAR 1.0, we implement two dense retrieval models for pooling and evaluation. For details about this test collection, please refer to the following paper.
We reused the 31,589 RDF datasets collected from 540 data portals from ACORDAR 1.0. The "./Data/datasets.json" file provides the ID and metadata of each dataset in JSON format. Each dataset can be downloaded via the links in the "download" field. We recommend using Apache Jena to parse the datasets.
We also provide deduplicated RDF data files in N-Triples format for each dataset available at Zenodo.
{
"datasets":
[
{
"license":"...",
"download":"..."
"size":"...",
"author":"...",
"created":"...",
"dataset_id":"...",
"description":"...",
"title":"...",
"version":"...",
"updated":"...",
"tags":"..."
},
...
]
}
The "./Data/all_queries.txt" file provides 510 keyword queries. Each row represents a query with two tab-separated columns: query_id and query_text. The queries can be divided into synthetic queries created by our human annotators ("./Data/synthetic_queries.txt") and TREC queries imported from the ad hoc topics (titles) used in the English Test Collections of TREC 1-8 ("./Data/trec_queries.txt").
The "./Data/question_queries.json" file provides 1,377 question queries with corresponding keyword queries.
[
{
"query_id": "...",
"query_text": "...",
"split": "...",
"questions": [
"...",
"..."
]
},
...
]
The "./Data/qrels.txt" file contains 19,340 qrels in TREC's qrels format, one qrel per row, where each row has four tab-separated columns: query_id, iteration (always zero and never used), dataset_id, and relevancy (0: irrelevant; 1: partially relevant; 2: highly relevant).
To make evaluation results being comparable, one should use the train-valid-test splits that we provide for five-fold cross-validation. The "./Data/Splits for Cross Validation" folder has five sub-folders. In each sub-folder we provide three qrel files as training, validation, and test sets, respectively.
We have evaluated four sparse retrieval models: (1) TF-IDF based cosine similarity, (2) BM25, (3) Language Model using Dirichlet priors for smoothing (LMD), (4) Fielded Sequential Dependence Model (FSDM) and two dense retrieval models: (5) Dense Passage Retrieval (DPR), (6) Contextualized late interaction over BERT (ColBERT). We ran sparse models over an inverted index of four metadata fields (title, description, author, tags) and four data fields (literals, classes, properties, entities), and ran dense models over pseudo metadata documents and (sampled) data documents. In each fold, for each sparse model, we merged the training and validation sets and performed grid search to tune its field weights from 0 to 1 in 0.1 increments using NDCG@10 as our optimization target. Dense models were fine-tuned in a standard way on the training and validation sets.
The "./Baselines" folder provides the output of each baseline method in TREC's results format. Below we show the mean evaluation results over the test sets in all five folds. One can use trec_eval for evaluation.
NDCG@5 | NDCG@10 | MAP@5 | MAP@10 | |
---|---|---|---|---|
TF-IDF | 0.4572 | 0.4605 | 0.1920 | 0.2654 |
BM25 | 0.5067 | 0.5020 | 0.2134 | 0.2910 |
LMD | 0.4725 | 0.4783 | 0.2105 | 0.2848 |
FSDM | 0.5222 | 0.5078 | 0.2395 | 0.3080 |
DPR | 0.3597 | 0.3469 | 0.1452 | 0.1809 |
ColBERT | 0.2788 | 0.2676 | 0.1133 | 0.1387 |
All source codes of our implementation are provided in ./Code.
- JDK 8+
- Apache Lucene 8.7.0
- Python 3.6
- torch 1.10
-
Inverted Index: Each RDF dataset was stored as a pseudo document in an inverted index which consists of eight fields.
- Four metadata fields: title, description, tags, and author.
- Four data fields: classes, properties, entities, and literals.
See codes in ./Code/sparse/indexing for details.
-
Sparse Retrieval Models: We implemented TF-IDF, BM25, LMD and FSDM based on Apache Lucene 8.7.0. See the codes in ./Code/sparse/models for details.
-
Field Weights Tuning: For each sparse model we performed grid search to tune its field weights from 0 to 1 in 0.1 increments using NDCG@10 as our optimization objective. See the codes in ./Code/sparse/fieldWeightsTuing for details. Field weights for pooling are stored at ./Code/sparse/pooling-field-weights.txt.
-
Retrieval Experiments: We employed ACORDAR 2.0 to evaluate all four sparse models. See the codes in ./Code/sparse/experiment for details.
-
Triples Extraction: We used IlluSnip to sample the content of RDF datasets. See ./Code/dense/preprocess/README.md for details.
-
Pseudo Documents: To apply dense models to RDF datasets, for each dataset we created two pseudo documents: metadata document concatenating human-readable information in metadata and data document concatenating the human-readable forms of the subject, predicate, and object in each sampled RDF triple. See ./Code/dense/preprocess/README.md for details.
-
Training and Retrieval: Both dense models (DPR and ColBERT) were implemented on the basis of their original source code. See ./Code/dense/DPR/README.md and ./Code/dense/ColBERT/README.md for details.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
@inproceedings{ACORDAR2,
author = {Qiaosheng Chen and
Weiqing Luo and
Zixian Huang and
Tengteng Lin and
Xiaxia Wang and
Ahmet Soylu and
Basil Ell and
Baifan Zhou and
Evgeny Kharlamov and
Gong Cheng},
title = {{ACORDAR} 2.0: {A} Test Collection for Ad Hoc Dataset Retrieval with
Densely Pooled Datasets and Question-Style Queries},
booktitle = {Proceedings of the 47th International {ACM} {SIGIR} Conference on
Research and Development in Information Retrieval, {SIGIR} 2024, Washington
DC, USA, July 14-18, 2024},
publisher = {{ACM}},
year = {2024},
url = {https://doi.org/10.1145/3626772.3657866},
doi = {10.1145/3626772.3657866}
}