Skip to content

Answering Sentence Selection

Petr Baudis edited this page Sep 21, 2015 · 1 revision

Answering Sentence Selection

When doing QA on unstructured data, we need to identify which of the sentences in retrieved search results bear the answer inside. This specific problem of Answering Sentence Selection has been studied extensively on its own, see:

http://aclweb.org/aclwiki/index.php?title=Question_Answering_(State_of_the_art)

Dataset

There is a reference dataset by Wang et al., 2007, with a technically nicer version available from Jacana (Yao et al., 2013). The dataset has a smaller manually tagged train and larger automatically tagged train-all. This dataset seems to be quite problematic and unbalanced, with the test split much easier than the train split and different by character too. (The splitting wasn't done randomly.)

We have therefore built our own dataset based on Solr enwiki search of curated factoid dataset question clues derived by v1.1 (XXX redo with v1.2?). TODO publish

Baseline Approach [v1.2]

Each search result produces three sentences. Sentence score is |clues| + |aboutclues|/4, i.e. number of clues matched within the sentence; aboutclues are clues that are also in the document title and these are weighted less.

Code references: PassScoreSimple

Investigated Alternatives

We have reproduced the main results of Wang and Ittycheriah, 2015 who use a classifier based on averaged word2vec of question and answering passage. This lives in branch f/sentence-selection.

The classifier experiments etc. live in

https://github.com/brmson/Sentence-selection

Initial work on this did not succeed in improving end-to-end quality.

TODO task-specific benchmarks

v1.2 Merged passages Experiments

Baseline: PassFilter Aggregating All Search Results. We use 36 wiki passages; technically, we should have used 18 for equivalence with old data. Curated APR 80.7%, MRR 0.431. Large2180 APR 77.5%, MRR 0.400. So, APR gets a lot better, but precision suffers - in general noise is high.

curated-test  efcdf3e 2015-09-06 PassFilter comparato... 135/292/430 31.4%/67.9% mrr 0.409 avgtime 2882.092
curated-test uefcdf3e 2015-09-06 PassFilter comparato... 146/347/430 34.0%/80.7% mrr 0.431 avgtime 2602.012
curated-test vefcdf3e 2015-09-06 PassFilter comparato... 139/292/430 32.3%/67.9% mrr 0.413 avgtime 2810.312
curated-trai  efcdf3e 2015-09-06 PassFilter comparato... 312/317/430 72.6%/73.7% mrr 0.731 avgtime 3428.575
curated-trai uefcdf3e 2015-09-06 PassFilter comparato... 193/349/430 44.9%/81.2% mrr 0.535 avgtime 3063.320
curated-trai vefcdf3e 2015-09-06 PassFilter comparato... 281/317/430 65.3%/73.7% mrr 0.691 avgtime 3324.276
large2180-te  8c4c2b6 2015-09-07 AnswerScoreDecisionF... 231/455/694 33.3%/65.6% mrr 0.421 avgtime 4479.787
large2180-te u8c4c2b6 2015-09-07 AnswerScoreDecisionF... 215/538/694 31.0%/77.5% mrr 0.400 avgtime 4073.270
large2180-te v8c4c2b6 2015-09-07 AnswerScoreDecisionF... 233/455/694 33.6%/65.6% mrr 0.419 avgtime 4383.052
large2180-te uefcdf3e 2015-09-06 PassFilter comparato... 157/362/455 34.5%/79.6% mrr 0.426 avgtime 11.181
large2180-tr  efcdf3e 2015-09-06 PassFilter comparato... 736/933/1479 49.8%/63.1% mrr 0.552 avgtime 10803.541
large2180-tr uefcdf3e 2015-09-06 PassFilter comparato... 469/1093/1479 31.7%/73.9% mrr 0.398 avgtime 9807.870
large2180-tr vefcdf3e 2015-09-06 PassFilter comparato... 592/933/1479 40.0%/63.1% mrr 0.481 avgtime 10569.594

Experiment with 18 inst. of 36 filtered passages; overfitting issue? Nope.

curated-test  b977952 2015-09-06 PassFilter: 36 -> 18... 125/281/430 29.1%/65.3% mrr 0.391 avgtime 2116.870
curated-test ub977952 2015-09-06 PassFilter: 36 -> 18... 137/336/430 31.9%/78.1% mrr 0.413 avgtime 1890.348
curated-test vb977952 2015-09-06 PassFilter: 36 -> 18... 127/281/430 29.5%/65.3% mrr 0.397 avgtime 2068.418
curated-trai  b977952 2015-09-06 PassFilter: 36 -> 18... 309/316/430 71.9%/73.5% mrr 0.726 avgtime 2374.899
curated-trai ub977952 2015-09-06 PassFilter: 36 -> 18... 200/340/430 46.5%/79.1% mrr 0.547 avgtime 2088.684
curated-trai vb977952 2015-09-06 PassFilter: 36 -> 18... 278/316/430 64.7%/73.5% mrr 0.684 avgtime 2302.559

word2vec Inclusion

(Word embeddings based selection without passages merged across results was tested long in the past, also didn't work well.)

PassageLogScore using word embedings (f/sentence-selection branch); 36 wiki passages --- curated APR 79.5%, MRR 0.424:

curated-test  3ca8dde 2015-09-07 PassageLogScore is n... 158/287/430 36.7%/66.7% mrr 0.443 avgtime 5583.548
curated-test u3ca8dde 2015-09-07 PassageLogScore is n... 149/342/430 34.7%/79.5% mrr 0.424 avgtime 5351.174
curated-test v3ca8dde 2015-09-07 PassageLogScore is n... 152/287/430 35.3%/66.7% mrr 0.437 avgtime 5523.453
curated-trai  3ca8dde 2015-09-07 PassageLogScore is n... 298/309/430 69.3%/71.9% mrr 0.705 avgtime 6067.949
curated-trai u3ca8dde 2015-09-07 PassageLogScore is n... 183/345/430 42.6%/80.2% mrr 0.506 avgtime 5771.900
curated-trai v3ca8dde 2015-09-07 PassageLogScore is n... 261/309/430 60.7%/71.9% mrr 0.656 avgtime 5987.042

Word embeddings, using 18 wiki passages instead of 36 --- curated APR 76.3%, MRR 0.409:

curated-test  d8c1cdf 2015-09-09 Reverted to old pass... 141/288/430 32.8%/67.0% mrr 0.413 avgtime 4004.076
curated-test ud8c1cdf 2015-09-09 Reverted to old pass... 140/328/430 32.6%/76.3% mrr 0.409 avgtime 3800.090
curated-test vd8c1cdf 2015-09-09 Reverted to old pass... 145/288/430 33.7%/67.0% mrr 0.418 avgtime 3943.449
curated-trai  d8c1cdf 2015-09-09 Reverted to old pass... 293/305/430 68.1%/70.9% mrr 0.694 avgtime 4247.339
curated-trai ud8c1cdf 2015-09-09 Reverted to old pass... 169/334/430 39.3%/77.7% mrr 0.486 avgtime 3987.276
curated-trai vd8c1cdf 2015-09-09 Reverted to old pass... 260/305/430 60.5%/70.9% mrr 0.650 avgtime 4165.975

Word embeddings, 36 wiki passages, disabling 76 features with usages under 1% --- curated APR 79.5%, MRR 0.399:

curated-test  c98dfad 2015-09-10 Reverted to 36 wiki ... 142/280/430 33.0%/65.1% mrr 0.415 avgtime 5521.780
curated-test uc98dfad 2015-09-10 Reverted to 36 wiki ... 133/342/430 30.9%/79.5% mrr 0.399 avgtime 5289.513
curated-test vc98dfad 2015-09-10 Reverted to 36 wiki ... 136/280/430 31.6%/65.1% mrr 0.409 avgtime 5462.999
curated-trai  c98dfad 2015-09-10 Reverted to 36 wiki ... 251/313/430 58.4%/72.8% mrr 0.639 avgtime 6057.289
curated-trai uc98dfad 2015-09-10 Reverted to 36 wiki ... 163/345/430 37.9%/80.2% mrr 0.476 avgtime 5767.036
curated-trai vc98dfad 2015-09-10 Reverted to 36 wiki ... 261/313/430 60.7%/72.8% mrr 0.653 avgtime 5978.567

Conclusion: This does not work in two aspects:

  • Merged set of passages across results is worse than considering top N passages of each result.

  • Selection based on word embeddings does not work well in the end-to-end system even though it should work well based on task-specific MRR measurements.

Other Ideas

CNN instead of simple averaging for word2vec.

Include signal from deep parse of the tree which we also have available. Possibly a TreeRNN.

Figure out a good way to match clues that can't be easily embedded (like proper names).

Use the word2vec alignment to build an attention model for answer production, possibly add as a biotagger CRF signal.