Entity Linking

A specific task is identifying concepts in the question and linking them to Wikipedia articles. This task is commonly referred to as "entity linking" (or "concept mapping" in some contexts). Two problems are (i) recognition (NER) and (ii) disambiguation (NED) - i.e. finding a string that refers to some concept (is "how i met your mother" a concept? is "partner" a concept? is "mrs. obama" a concept?), then figuring out what specific concept it refers to ("oscar" is an award, movie, oscar wilde, ...?).

We measure specific performance on this task based on golden standard dataset of moviesC train split (moviesC-concepts, in dataset-factoid-movies), created by Nguyen Hoang Long. Metrics can be micro/macro recall, precision and recall of linked entities. Micro- metrics are usually used for task development, macro- metrics are useful for evaluating all-system impact.

XXX: At the point we start doing any serious tuning and machine learning, we should split the dataset to devtest, trainmodel and val splits.

Impact

On v1.2, the impact of entity linking as measured on moviesC-concepts:

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.622
MRR for questions with superfluous concepts:    0.512 (wΔ 0.023)
MRR for questions with missing concepts:        0.427 (wΔ 0.019)
MRR for questions with no concepts at all:      0.007 (wΔ 0.058)
total MRR: 0.521

FIXME update for master

Baseline Approach [1.3]

Extra set of entity labels: The CrossWikis dataset. (As used in Bast and Haussmann, 2015.) Contains many aliases for entities, including probability of mapping (disregarding context).

Performance (concept_linking_performance.py):

TODO

Baseline Approach [v1.2]

Generate clues - all nouns, NPs, NEs, the SV, LAT, Subject. Look up each clue in DBpedia, fuzzy lookup with edit distance le 3. Edit dist smaller than 1 for whitespace, interpunction, 's edits; fixed 0.5 for case change. No disambiguation is performed; all matched articles (of the same edit distance) are considered. In case of exact matches, sub-strings of a match are not searched again.

The fuzzy search does not do exhaustive edit distance, just sorted list approximation. In general, this baseline is inspired by Yao: Lean question answering over freebase from scratch (2015).

Code references: CluesToConcepts, DBpediaTitles, label-lookup.

Performance (concept_linking_performance.py):

:: per-question statistics (macro measure)
exact match: 324 questions (59.779%)
partial_match (extra): 113 questions (20.849%)
partial_match (missing): 54 questions (9.963%)
not found: 51 questions (9.410%)
precision 68.144%, recall 86.285%, F1 70.291%

:: per-entity statistics (micro measure)
precision: 619/2890, 21.419% 
recall: 619/721, 85.853% 
F1 34.284%

Investigated Alternatives

We should create a blacklist of common but never appropriate concepts (like "name").

We plan to add interactive entity selection step to the QA process as an optional part (mainly for the web interface).

ClueWeb as processed by Google (Bast and Haussmann, 2015) also contains popularity info.

We should take question context in account, probably using semantic enrichment (vector embedding of entities using their descriptions).

Other Ideas

Custom CRF: Training out own CRF for NER might be interesting and low-effort. We have a CRF implementation for answer production already anyway. Compared to many other systems, we can leverage deep parse data. This would help only with NER, not NED.

STAGG: (2015 state-of-art on WebQuestions) Uses S-MART, proprietary NER/NED by Microsoft. Not open source. http://www.cc.gatech.edu/~yyang319/download/yang-acl-2015.pdf

FOX: http://fox-demo.aksw.org/#!/home open source framework for NER+NED. The demo is slow. AGDISTIS for NED may be an option on its own. But a few informal tries were not convincing - wrong and slow.

IBM Watson: Statistical true-casing; custom deep parser (ESG) and relation extraction to predicate-argument structure (PAS); arguments are entities. Unclear how disambiguation is done.

TODO clean up and sort links below.

General entity linking reading list: http://nlp.cs.rpi.edu/kbp/2014/elreading.html

The academically investigated entity linking task most similar to our interactive QA seems to be entity linking in tweets.

NEEL challenge: http://ceur-ws.org/Vol-1395/microposts2015_neel-challenge-report/microposts2015_neel-challenge-report.pdf

Overview of recent available systems: https://github.com/AKSW/gerbil/wiki/Available-APIs-and-approaches

Other papers (TODO read and decide):

CrossWikis dataset experiments

Using the CrossWikis dataset in a WebApi, we gain the the correct label and it's probability given the query string. We tried several strategies on the moviesC-train dataset.

The reference questionDump using the latest (21.9.2015) yodaQA pull and live label lookup, sorted by dbpedia popularity

:: per-question statistics (macro measure)
exact match: 251 questions (46.310%)
partial_match (extra): 154 questions (28.413%)
partial_match (missing): 68 questions (12.546%)
not found: 69 questions (12.731%)
precision 57.873%, recall 81.857%, F1 60.845%

:: per-entity statistics (micro measure)
precision: 585/3520, 16.619% 
recall: 585/721, 81.137% 
F1 27.588%

The reference questionDump, but we only take the top 5 global concepts, sorted by dbpedia popularity

:: per-question statistics (macro measure)
exact match: 251 questions (46.310%)
partial_match (extra): 119 questions (21.956%)
partial_match (missing): 75 questions (13.838%)
not found: 97 questions (17.897%)
precision 58.964%, recall 76.076%, F1 62.255%

:: per-entity statistics (micro measure)
precision: 537/1295, 41.467% 
recall: 537/721, 74.480% 
F1 53.274%

The sqlite label lookup, without capitalization, sorted by dbpedia popularity (we set edit distance to 0), All concepts per clue, take all concepts after sort

dist = 0
score = popularity
uall concepts per clue
:: per-question statistics (macro measure)
exact match: 135 questions (24.908%)
partial_match (extra): 172 questions (31.734%)
partial_match (missing): 84 questions (15.498%)
not found: 146 questions (26.937%)
precision 41.930%, recall 72.817%, F1 46.205%

:: per-entity statistics (micro measure)
precision: 524/2892, 18.119% 
recall: 524/721, 72.677% 
F1 29.006%

The sqlite label lookup, with capitalization, sorted by dbpedia popularity (we set edit distance to 0), All concepts per clue, take all concepts after sort

dist = 0
score = popularity
:: per-question statistics (macro measure)
exact match: 183 questions (33.764%)
partial_match (extra): 198 questions (36.531%)
partial_match (missing): 61 questions (11.255%)
not found: 92 questions (16.974%)
precision 50.895%, recall 88.653%, F1 56.069%

:: per-entity statistics (micro measure)
precision: 643/3379, 19.029% 
recall: 643/721, 89.182% 
F1 31.366%

The sqlite label lookup, with capitalization, sorted by dbpedia popularity (we set edit distance to 0), All concepts per clue, take top5 concepts after sort

dist = 0
score = popularity
all concepts per clue
:: per-question statistics (macro measure)
exact match: 183 questions (33.764%)
partial_match (extra): 159 questions (29.336%)
partial_match (missing): 81 questions (14.945%)
not found: 111 questions (20.480%)
precision 52.114%, recall 80.658%, F1 57.672%

:: per-entity statistics (micro measure)
precision: 573/1432, 40.014% 
recall: 573/721, 79.473% 
F1 53.228%

The sqlite label lookup, with capitalization, sorted by dbpedia popularity (we set edit distance to 0), 1 concept per clue, take top5 concepts after sort

1 concept per clue
:: per-question statistics (macro measure)
exact match: 274 questions (50.554%)
partial_match (extra): 69 questions (12.731%)
partial_match (missing): 71 questions (13.100%)
not found: 126 questions (23.247%)
precision 65.293%, recall 70.572%, F1 66.174%

:: per-entity statistics (micro measure)
precision: 506/692, 73.121% 
recall: 506/721, 70.180% 
F1 71.621%

The sqlite label lookup, with capitalization, sorted by CrossWiki probability (we set edit distance to 0), 1 concept per clue, take top5 concepts after sort

dist = 0
score = probability
1 concept per clue
:: per-question statistics (macro measure)
exact match: 274 questions (50.554%)
partial_match (extra): 69 questions (12.731%)
partial_match (missing): 72 questions (13.284%)
not found: 125 questions (23.063%)
precision 65.252%, recall 70.664%, F1 66.131%

:: per-entity statistics (micro measure)
precision: 506/708, 71.469% 
recall: 506/721, 70.180% 
F1 70.819%

The sqlite label lookup, with capitalization, sorted by 1 - probability (we set edit distance to 0), 1 concept per clue, take top5 concepts after sort

dist = 0
score = 1 - probability
1 concept per clue
:: per-question statistics (macro measure)
exact match: 274 questions (50.554%)
partial_match (extra): 63 questions (11.624%)
partial_match (missing): 68 questions (12.546%)
not found: 135 questions (24.908%)
precision 64.986%, recall 69.188%, F1 65.689%

:: per-entity statistics (micro measure)
precision: 496/633, 78.357% 
recall: 496/721, 68.793% 
F1 73.264%

Combined Lookup Experiments

Sqlite + fuzzy label lookup sorted by logistic regression classifier take top 5

:: per-question statistics (macro measure)
exact match: 251 questions (46.310%)
partial_match (extra): 124 questions (22.878%)
partial_match (missing): 79 questions (14.576%)
not found: 88 questions (16.236%)
precision 59.361%, recall 77.552%, F1 62.726%

:: per-entity statistics (micro measure)
precision: 547/1371, 39.898% 
recall: 547/721, 75.867% 
F1 52.294%

Sqlite + fuzzy label lookup took only concepts with probability from classifier higher than 0.5

:: per-question statistics (macro measure)
exact match: 224 questions (41.328%)
partial_match (extra): 10 questions (1.845%)
partial_match (missing): 97 questions (17.897%)
not found: 211 questions (38.930%)
precision 59.314%, recall 52.460%, F1 54.125%

:: per-entity statistics (micro measure)
precision: 367/414, 88.647% 
recall: 367/721, 50.902% 
F1 64.670%

At this point, we made a few bugfixes and started end-to-end experiments too. The baseline pipeline performance here is:

moviesC-test  a770e5f 2015-08-21 Mark: label-lookup 1... 102/168/233 43.8%/72.1% mrr 0.509 avgtime 585.312
moviesC-test ua770e5f 2015-08-21 Mark: label-lookup 1... 95/176/233 40.8%/75.5% mrr 0.494 avgtime 447.181
moviesC-test va770e5f 2015-08-21 Mark: label-lookup 1... 104/168/233 44.6%/72.1% mrr 0.517 avgtime 530.785
moviesC-trai  a770e5f 2015-08-21 Mark: label-lookup 1... 313/388/542 57.7%/71.6% mrr 0.629 avgtime 1463.521
moviesC-trai ua770e5f 2015-08-21 Mark: label-lookup 1... 240/399/542 44.3%/73.6% mrr 0.522 avgtime 1176.910
moviesC-trai va770e5f 2015-08-21 Mark: label-lookup 1... 287/388/542 53.0%/71.6% mrr 0.596 avgtime 1351.434

Top-5 per clue, concept score AF contains logPopularity:

moviesC-test  36942ad 2015-09-30 CluesToConcepts: Fil... 106/164/233 45.5%/70.4% mrr 0.525 avgtime 496.340
moviesC-test u36942ad 2015-09-30 CluesToConcepts: Fil... 106/176/233 45.5%/75.5% mrr 0.530 avgtime 361.547
moviesC-test v36942ad 2015-09-30 CluesToConcepts: Fil... 106/164/233 45.5%/70.4% mrr 0.529 avgtime 440.034
moviesC-trai  36942ad 2015-09-30 CluesToConcepts: Fil... 324/399/542 59.8%/73.6% mrr 0.649 avgtime 1198.090
moviesC-trai u36942ad 2015-09-30 CluesToConcepts: Fil... 260/406/542 48.0%/74.9% mrr 0.558 avgtime 907.253
moviesC-trai v36942ad 2015-09-30 CluesToConcepts: Fil... 300/399/542 55.4%/73.6% mrr 0.621 avgtime 1078.025

:: per-question statistics (macro measure)
exact match: 217 questions (40.037%)
partial_match (extra): 184 questions (33.948%)
partial_match (missing): 57 questions (10.517%)
not found: 66 questions (12.177%)
precision 60.901%, recall 95.172%, F2.0 78.279%

:: per-entity statistics (micro measure)
precision: 683/1467, 46.558% 
recall: 683/721, 94.730% 
F2.0 78.488%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.709
MRR for questions with superfluous concepts:    0.588 (wΔ 0.041)
MRR for questions with missing concepts:        0.426 (wΔ 0.030)
MRR for questions with no concepts at all:      0.051 (wΔ 0.080)
total MRR: 0.558

Updating the score feature to contain classifier output:

moviesC-test  5d2c4c4 2015-09-30 Merge remote-trackin... 105/164/233 45.1%/70.4% mrr 0.521 avgtime 475.806
moviesC-test u5d2c4c4 2015-09-30 Merge remote-trackin... 102/176/233 43.8%/75.5% mrr 0.517 avgtime 330.033
moviesC-test v5d2c4c4 2015-09-30 Merge remote-trackin... 108/164/233 46.4%/70.4% mrr 0.524 avgtime 419.815
moviesC-trai  5d2c4c4 2015-09-30 Merge remote-trackin... 324/398/542 59.8%/73.4% mrr 0.645 avgtime 1108.730
moviesC-trai u5d2c4c4 2015-09-30 Merge remote-trackin... 260/406/542 48.0%/74.9% mrr 0.555 avgtime 817.373
moviesC-trai v5d2c4c4 2015-09-30 Merge remote-trackin... 298/398/542 55.0%/73.4% mrr 0.615 avgtime 987.876

:: per-question statistics (macro measure)
exact match: 217 questions (40.037%)
partial_match (extra): 184 questions (33.948%)
partial_match (missing): 57 questions (10.517%)
not found: 66 questions (12.177%)
precision 60.901%, recall 95.172%, F2.0 78.279%

:: per-entity statistics (micro measure)
precision: 683/1467, 46.558% 
recall: 683/721, 94.730% 
F2.0 78.488%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.700
MRR for questions with superfluous concepts:    0.591 (wΔ 0.037)
MRR for questions with missing concepts:        0.415 (wΔ 0.030)
MRR for questions with no concepts at all:      0.056 (wΔ 0.078)
total MRR: 0.555

Top 3 is much worse.

Removing top5 restriction (BEST MRR):

moviesC-test  ccf35e4 2015-09-30 CluesToConcepts: Pro... 104/170/233 44.6%/73.0% mrr 0.525 avgtime 630.622
moviesC-test uccf35e4 2015-09-30 CluesToConcepts: Pro... 106/180/233 45.5%/77.3% mrr 0.532 avgtime 485.724
moviesC-test vccf35e4 2015-09-30 CluesToConcepts: Pro... 103/170/233 44.2%/73.0% mrr 0.525 avgtime 575.579
moviesC-trai  ccf35e4 2015-09-30 CluesToConcepts: Pro... 326/399/542 60.1%/73.6% mrr 0.651 avgtime 1573.871
moviesC-trai uccf35e4 2015-09-30 CluesToConcepts: Pro... 259/410/542 47.8%/75.6% mrr 0.555 avgtime 1257.217
moviesC-trai vccf35e4 2015-09-30 CluesToConcepts: Pro... 301/399/542 55.5%/73.6% mrr 0.619 avgtime 1453.219

:: per-question statistics (macro measure)
exact match: 217 questions (40.037%)
partial_match (extra): 190 questions (35.055%)
partial_match (missing): 53 questions (9.779%)
not found: 64 questions (11.808%)
precision 57.519%, recall 98.309%, F2.0 71.801%

:: per-entity statistics (micro measure)
precision: 712/3772, 18.876% 
recall: 712/721, 98.752% 
F2.0 53.486%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.698
MRR for questions with superfluous concepts:    0.581 (wΔ 0.041)
MRR for questions with missing concepts:        0.447 (wΔ 0.024)
MRR for questions with no concepts at all:      0.042 (wΔ 0.077)
total MRR: 0.554

Top 5 with max-score per concept (pasky's choice):

moviesC-test  c88c801 2015-10-01 CluesToConcepts: Use... 100/166/233 42.9%/71.2% mrr 0.509 avgtime 453.911
moviesC-test uc88c801 2015-10-01 CluesToConcepts: Use... 104/176/233 44.6%/75.5% mrr 0.525 avgtime 325.579
moviesC-test vc88c801 2015-10-01 CluesToConcepts: Use... 104/166/233 44.6%/71.2% mrr 0.522 avgtime 399.907
moviesC-trai  c88c801 2015-10-01 CluesToConcepts: Use... 321/397/542 59.2%/73.2% mrr 0.645 avgtime 1107.993
moviesC-trai uc88c801 2015-10-01 CluesToConcepts: Use... 255/406/542 47.0%/74.9% mrr 0.549 avgtime 825.427
moviesC-trai vc88c801 2015-10-01 CluesToConcepts: Use... 298/397/542 55.0%/73.2% mrr 0.615 avgtime 990.393

:: per-question statistics (macro measure)
exact match: 217 questions (40.037%)
partial_match (extra): 184 questions (33.948%)
partial_match (missing): 57 questions (10.517%)
not found: 66 questions (12.177%)
precision 60.901%, recall 95.172%, F2.0 78.279%

:: per-entity statistics (micro measure)
precision: 683/1467, 46.558% 
recall: 683/721, 94.730% 
F2.0 78.488%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.710
MRR for questions with superfluous concepts:    0.567 (wΔ 0.048)
MRR for questions with missing concepts:        0.409 (wΔ 0.032)
MRR for questions with no concepts at all:      0.049 (wΔ 0.081)
total MRR: 0.549

(we prefer this to the best MRR version as some of our other yet-unmerged work includes concept-sensitive things for which the lower precision could really hurt, plus we are a lot faster this way)

Removing top5 restriction now:

moviesC-test  fb58877 2015-10-01 CluesToConcepts: Pro... 102/170/233 43.8%/73.0% mrr 0.518 avgtime 623.777
moviesC-test ufb58877 2015-10-01 CluesToConcepts: Pro... 100/180/233 42.9%/77.3% mrr 0.519 avgtime 478.693
moviesC-test vfb58877 2015-10-01 CluesToConcepts: Pro... 103/170/233 44.2%/73.0% mrr 0.526 avgtime 568.874
moviesC-trai  fb58877 2015-10-01 CluesToConcepts: Pro... 321/399/542 59.2%/73.6% mrr 0.647 avgtime 1552.976
moviesC-trai ufb58877 2015-10-01 CluesToConcepts: Pro... 257/410/542 47.4%/75.6% mrr 0.551 avgtime 1236.578
moviesC-trai vfb58877 2015-10-01 CluesToConcepts: Pro... 298/399/542 55.0%/73.6% mrr 0.615 avgtime 1432.990

:: per-question statistics (macro measure)
exact match: 217 questions (40.037%)
partial_match (extra): 190 questions (35.055%)
partial_match (missing): 53 questions (9.779%)
not found: 64 questions (11.808%)
precision 57.519%, recall 98.309%, F2.0 71.801%

:: per-entity statistics (micro measure)
precision: 712/3772, 18.876% 
recall: 712/721, 98.752% 
F2.0 53.486%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.687
MRR for questions with superfluous concepts:    0.581 (wΔ 0.037)
MRR for questions with missing concepts:        0.448 (wΔ 0.023)
MRR for questions with no concepts at all:      0.041 (wΔ 0.076)
total MRR: 0.551

Inst. of top5, include only concepts with p>0.5:

moviesC-test  c78bb40 2015-10-01 CluesToConcepts: Ign... 81/135/233 34.8%/57.9% mrr 0.409 avgtime 298.411
moviesC-test uc78bb40 2015-10-01 CluesToConcepts: Ign... 83/145/233 35.6%/62.2% mrr 0.417 avgtime 185.678
moviesC-test vc78bb40 2015-10-01 CluesToConcepts: Ign... 80/135/233 34.3%/57.9% mrr 0.413 avgtime 246.232
moviesC-trai  c78bb40 2015-10-01 CluesToConcepts: Ign... 264/321/542 48.7%/59.2% mrr 0.527 avgtime 676.335
moviesC-trai uc78bb40 2015-10-01 CluesToConcepts: Ign... 202/325/542 37.3%/60.0% mrr 0.437 avgtime 441.667
moviesC-trai vc78bb40 2015-10-01 CluesToConcepts: Ign... 242/321/542 44.6%/59.2% mrr 0.496 avgtime 570.315

:: per-question statistics (macro measure)
exact match: 271 questions (50.000%)
partial_match (extra): 15 questions (2.768%)
partial_match (missing): 88 questions (16.236%)
not found: 166 questions (30.627%)
precision 66.302%, recall 61.654%, F2.0 61.679%

:: per-entity statistics (micro measure)
precision: 433/482, 89.834% 
recall: 433/721, 60.055% 
F2.0 64.320%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.665
MRR for questions with superfluous concepts:    0.641 (wΔ 0.001)
MRR for questions with missing concepts:        0.441 (wΔ 0.036)
MRR for questions with no concepts at all:      0.045 (wΔ 0.190)
total MRR: 0.437

Concepts with p>0.25:

moviesC-test  d1d8279 2015-10-01 CluesToConcepts: Ign... 87/146/233 37.3%/62.7% mrr 0.446 avgtime 319.162
moviesC-test ud1d8279 2015-10-01 CluesToConcepts: Ign... 91/157/233 39.1%/67.4% mrr 0.454 avgtime 200.610
moviesC-test vd1d8279 2015-10-01 CluesToConcepts: Ign... 93/146/233 39.9%/62.7% mrr 0.464 avgtime 265.448
moviesC-trai  d1d8279 2015-10-01 CluesToConcepts: Ign... 297/355/542 54.8%/65.5% mrr 0.588 avgtime 740.900
moviesC-trai ud1d8279 2015-10-01 CluesToConcepts: Ign... 224/362/542 41.3%/66.8% mrr 0.486 avgtime 489.878
moviesC-trai vd1d8279 2015-10-01 CluesToConcepts: Ign... 270/355/542 49.8%/65.5% mrr 0.555 avgtime 628.950

:: per-question statistics (macro measure)
exact match: 293 questions (54.059%)
partial_match (extra): 38 questions (7.011%)
partial_match (missing): 76 questions (14.022%)
not found: 130 questions (23.985%)
precision 69.843%, recall 69.926%, F2.0 68.834%

:: per-entity statistics (micro measure)
precision: 496/589, 84.211% 
recall: 496/721, 68.793% 
F2.0 71.408%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.678
MRR for questions with superfluous concepts:    0.501 (wΔ 0.012)
MRR for questions with missing concepts:        0.465 (wΔ 0.030)
MRR for questions with no concepts at all:      0.060 (wΔ 0.148)
total MRR: 0.486

Let's come back to pasky's choice. Interesting issue is reduction in score when changing the concept score AF to the classifier output. Unfortunately, adding the (logPop, labelProb) concept features overfits:

moviesC-test  c6fb40e 2015-10-01 AF: +ConceptLogPop, ... 101/165/233 43.3%/70.8% mrr 0.509 avgtime 467.276
moviesC-test uc6fb40e 2015-10-01 AF: +ConceptLogPop, ... 101/176/233 43.3%/75.5% mrr 0.512 avgtime 336.097
moviesC-test vc6fb40e 2015-10-01 AF: +ConceptLogPop, ... 106/165/233 45.5%/70.8% mrr 0.527 avgtime 411.957
moviesC-trai  c6fb40e 2015-10-01 AF: +ConceptLogPop, ... 323/399/542 59.6%/73.6% mrr 0.650 avgtime 1139.667
moviesC-trai uc6fb40e 2015-10-01 AF: +ConceptLogPop, ... 252/406/542 46.5%/74.9% mrr 0.547 avgtime 851.701
moviesC-trai vc6fb40e 2015-10-01 AF: +ConceptLogPop, ... 296/399/542 54.6%/73.6% mrr 0.612 avgtime 1016.868

Also, adding RR of the source clue (sorted by produced concept scores) overfits:

moviesC-test  23620f9 2015-10-02 Merge remote-trackin... 98/166/233 42.1%/71.2% mrr 0.502 avgtime 473.172
moviesC-test u23620f9 2015-10-02 Merge remote-trackin... 97/176/233 41.6%/75.5% mrr 0.505 avgtime 338.977
moviesC-test v23620f9 2015-10-02 Merge remote-trackin... 102/166/233 43.8%/71.2% mrr 0.509 avgtime 416.514
moviesC-trai  23620f9 2015-10-02 Merge remote-trackin... 321/398/542 59.2%/73.4% mrr 0.644 avgtime 1112.087
moviesC-trai u23620f9 2015-10-02 Merge remote-trackin... 260/406/542 48.0%/74.9% mrr 0.554 avgtime 819.984
moviesC-trai v23620f9 2015-10-02 Merge remote-trackin... 297/398/542 54.8%/73.4% mrr 0.610 avgtime 989.856

The above plus same-label concepts sharing the same RR rank:

moviesC-test  c2adc54 2015-10-02 CluesToConcepts Labe... 108/166/233 46.4%/71.2% mrr 0.520 avgtime 458.249
moviesC-test uc2adc54 2015-10-02 CluesToConcepts Labe... 99/176/233 42.5%/75.5% mrr 0.502 avgtime 325.615
moviesC-test vc2adc54 2015-10-02 CluesToConcepts Labe... 105/166/233 45.1%/71.2% mrr 0.515 avgtime 402.197
moviesC-trai  c2adc54 2015-10-02 CluesToConcepts Labe... 314/397/542 57.9%/73.2% mrr 0.636 avgtime 1104.468
moviesC-trai uc2adc54 2015-10-02 CluesToConcepts Labe... 256/406/542 47.2%/74.9% mrr 0.549 avgtime 813.521
moviesC-trai vc2adc54 2015-10-02 CluesToConcepts Labe... 297/397/542 54.8%/73.2% mrr 0.612 avgtime 983.578

Verdict: Merged into master!

Retesting with New Baseline

The new baseline is:

moviesC-test  fb0865b 2015-10-07 Merge branch 'master... 105/165/233 45.1%/70.8% mrr 0.520 avgtime 590.383
moviesC-test ufb0865b 2015-10-07 Merge branch 'master... 98/176/233 42.1%/75.5% mrr 0.500 avgtime 451.362
moviesC-test vfb0865b 2015-10-07 Merge branch 'master... 107/165/233 45.9%/70.8% mrr 0.525 avgtime 535.190
moviesC-trai  fb0865b 2015-10-07 Merge branch 'master... 316/389/542 58.3%/71.8% mrr 0.634 avgtime 1565.001
moviesC-trai ufb0865b 2015-10-07 Merge branch 'master... 243/399/542 44.8%/73.6% mrr 0.524 avgtime 1218.373
moviesC-trai vfb0865b 2015-10-07 Merge branch 'master... 286/389/542 52.8%/71.8% mrr 0.594 avgtime 1438.145

The main entity-linking version is now pasky's choice (top 5 concepts per clue, sorted by score) with same-label concepts sharing the same RR rank and including RR of the source clue:

moviesC-test  ab04e7d 2015-10-02 CluesToConcepts Labe... 101/165/233 43.3%/70.8% mrr 0.511 avgtime 435.469
moviesC-test uab04e7d 2015-10-02 CluesToConcepts Labe... 100/175/233 42.9%/75.1% mrr 0.510 avgtime 306.597
moviesC-test vab04e7d 2015-10-02 CluesToConcepts Labe... 101/165/233 43.3%/70.8% mrr 0.512 avgtime 380.044
moviesC-trai  ab04e7d 2015-10-02 CluesToConcepts Labe... 328/400/542 60.5%/73.8% mrr 0.657 avgtime 1133.952
moviesC-trai uab04e7d 2015-10-02 CluesToConcepts Labe... 250/406/542 46.1%/74.9% mrr 0.547 avgtime 812.428
moviesC-trai vab04e7d 2015-10-02 CluesToConcepts Labe... 292/400/542 53.9%/73.8% mrr 0.613 avgtime 1003.900

Relative to this baseline, we have retested the following...

Hold-out test with disabled same-label rank-sharing:

moviesC-test  c4818d0 2015-10-07 Merge branch 'f/labe... 101/166/233 43.3%/71.2% mrr 0.508 avgtime 435.964
moviesC-test uc4818d0 2015-10-07 Merge branch 'f/labe... 97/175/233 41.6%/75.1% mrr 0.502 avgtime 307.855
moviesC-test vc4818d0 2015-10-07 Merge branch 'f/labe... 101/166/233 43.3%/71.2% mrr 0.511 avgtime 380.892
moviesC-trai  c4818d0 2015-10-07 Merge branch 'f/labe... 320/397/542 59.0%/73.2% mrr 0.644 avgtime 1151.758
moviesC-trai uc4818d0 2015-10-07 Merge branch 'f/labe... 260/406/542 48.0%/74.9% mrr 0.556 avgtime 828.326
moviesC-trai vc4818d0 2015-10-07 Merge branch 'f/labe... 296/397/542 54.6%/73.2% mrr 0.612 avgtime 1021.199

Hold-out test with disabled RR of the source clue:

moviesC-test  9de9f9f 2015-10-07 Revert "ConceptRr ->... 96/163/233 41.2%/70.0% mrr 0.495 avgtime 437.958
moviesC-test u9de9f9f 2015-10-07 Revert "ConceptRr ->... 95/175/233 40.8%/75.1% mrr 0.493 avgtime 310.174
moviesC-test v9de9f9f 2015-10-07 Revert "ConceptRr ->... 99/163/233 42.5%/70.0% mrr 0.504 avgtime 382.783
moviesC-trai  9de9f9f 2015-10-07 Revert "ConceptRr ->... 323/398/542 59.6%/73.4% mrr 0.650 avgtime 1147.611
moviesC-trai u9de9f9f 2015-10-07 Revert "ConceptRr ->... 256/406/542 47.2%/74.9% mrr 0.553 avgtime 826.921
moviesC-trai v9de9f9f 2015-10-07 Revert "ConceptRr ->... 294/398/542 54.2%/73.4% mrr 0.611 avgtime 1016.627

Adding the (logPop, labelProb) concept features:

moviesC-test  c6d061d 2015-10-01 AF: +ConceptLogPop, ... 97/165/233 41.6%/70.8% mrr 0.507 avgtime 445.602
moviesC-test uc6d061d 2015-10-01 AF: +ConceptLogPop, ... 99/175/233 42.5%/75.1% mrr 0.501 avgtime 316.490
moviesC-test vc6d061d 2015-10-01 AF: +ConceptLogPop, ... 97/165/233 41.6%/70.8% mrr 0.503 avgtime 390.746
moviesC-trai  c6d061d 2015-10-01 AF: +ConceptLogPop, ... 319/398/542 58.9%/73.4% mrr 0.646 avgtime 1158.988
moviesC-trai uc6d061d 2015-10-01 AF: +ConceptLogPop, ... 250/406/542 46.1%/74.9% mrr 0.547 avgtime 836.163
moviesC-trai vc6d061d 2015-10-01 AF: +ConceptLogPop, ... 295/398/542 54.4%/73.4% mrr 0.610 avgtime 1027.906

Removing the top-5 restriction:

moviesC-test  1a63f90 2015-10-01 CluesToConcepts: Pro... 101/172/233 43.3%/73.8% mrr 0.516 avgtime 589.359
moviesC-test u1a63f90 2015-10-01 CluesToConcepts: Pro... 97/180/233 41.6%/77.3% mrr 0.501 avgtime 448.405
moviesC-test v1a63f90 2015-10-01 CluesToConcepts: Pro... 108/172/233 46.4%/73.8% mrr 0.534 avgtime 534.349
moviesC-trai  1a63f90 2015-10-01 CluesToConcepts: Pro... 323/400/542 59.6%/73.8% mrr 0.652 avgtime 1628.066
moviesC-trai u1a63f90 2015-10-01 CluesToConcepts: Pro... 250/410/542 46.1%/75.6% mrr 0.544 avgtime 1260.323
moviesC-trai v1a63f90 2015-10-01 CluesToConcepts: Pro... 288/400/542 53.1%/73.8% mrr 0.608 avgtime 1498.734

moviesC-test  9fd9e63 2015-10-09 ConceptClassifier: R... 101/162/233 43.3%/69.5% mrr 0.503 avgtime 442.820
moviesC-test u9fd9e63 2015-10-09 ConceptClassifier: R... 102/176/233 43.8%/75.5% mrr 0.514 avgtime 310.597
moviesC-test v9fd9e63 2015-10-09 ConceptClassifier: R... 105/162/233 45.1%/69.5% mrr 0.518 avgtime 387.072
moviesC-trai  9fd9e63 2015-10-09 ConceptClassifier: R... 321/397/542 59.2%/73.2% mrr 0.649 avgtime 1228.061
moviesC-trai u9fd9e63 2015-10-09 ConceptClassifier: R... 250/405/542 46.1%/74.7% mrr 0.543 avgtime 904.139
moviesC-trai v9fd9e63 2015-10-09 ConceptClassifier: R... 294/397/542 54.2%/73.2% mrr 0.611 avgtime 1098.522

Results on other datasets

curated baseline:

curated-test  542b74d 2015-10-07 Merge branch 'master... 139/287/430 32.3%/66.7% mrr 0.412 avgtime 2506.641
curated-test u542b74d 2015-10-07 Merge branch 'master... 148/332/430 34.4%/77.2% mrr 0.429 avgtime 2265.224
curated-test v542b74d 2015-10-07 Merge branch 'master... 137/287/430 31.9%/66.7% mrr 0.417 avgtime 2434.493
curated-trai  542b74d 2015-10-07 Merge branch 'master... 291/303/430 67.7%/70.5% mrr 0.689 avgtime 3183.894
curated-trai u542b74d 2015-10-07 Merge branch 'master... 181/332/430 42.1%/77.2% mrr 0.505 avgtime 2799.096
curated-trai v542b74d 2015-10-07 Merge branch 'master... 253/303/430 58.8%/70.5% mrr 0.639 avgtime 3072.476

curated main new version:

curated-test  1f0b793 2015-10-08 Merge branch 'f/labe... 153/282/430 35.6%/65.6% mrr 0.433 avgtime 2456.937
curated-test u1f0b793 2015-10-08 Merge branch 'f/labe... 147/333/430 34.2%/77.4% mrr 0.433 avgtime 2226.956
curated-test v1f0b793 2015-10-08 Merge branch 'f/labe... 153/282/430 35.6%/65.6% mrr 0.434 avgtime 2387.664
curated-trai  1f0b793 2015-10-08 Merge branch 'f/labe... 294/305/430 68.4%/70.9% mrr 0.694 avgtime 3316.498
curated-trai u1f0b793 2015-10-08 Merge branch 'f/labe... 173/334/430 40.2%/77.7% mrr 0.493 avgtime 2951.546
curated-trai v1f0b793 2015-10-08 Merge branch 'f/labe... 253/305/430 58.8%/70.9% mrr 0.641 avgtime 3206.849

TODO: webquestions retest baseline, new

On moviesE, d/movies (443920b), using moviesE-train-ovt-u66417f1.tsv:

:: per-question statistics (macro measure)
exact match: 303 questions (26.579%)
partial_match (extra): 745 questions (65.351%)
partial_match (missing): 40 questions (3.509%)
not found: 51 questions (4.474%)
precision 44.901%, recall 95.928%, F2.0 63.612%

:: per-entity statistics (micro measure)
precision: 1305/12209, 10.689% 
recall: 1305/1368, 95.395% 
F2.0 36.904%

:: answer MRR per entity error type (wΔ is MRR drop against correct weighted by question proportion)
MRR for questions with exact match of concepts: 0.830
MRR for questions with superfluous concepts:    0.756 (wΔ 0.048)
MRR for questions with missing concepts:        0.524 (wΔ 0.011)
MRR for questions with no concepts at all:      0.143 (wΔ 0.031)
total MRR: 0.741

DBpedia Spotlight Experiments

DBpedia Spotlight is a popular de facto reference in simple QA pipelines, apparently (e.g. some QALD, BioASQ contestants used it). It can do both NER and NED, at once or separately.

Using DBpedia Spotlight should be pretty easy and might improve baseline significantly, though manual inspection shows that it also makes a lot of errors.

Outlook: Augmenting DBpedia Spotlight (opensource) with our POS-tagging / deep parsing info (so that "win" is not translated to some baseballist with given name "Win") might be interesting. We might also include word embeddings (almost ready). This should improve both NER and NED performance. But we first need to verify no better openly available alternative exists.

performance(''spotlight_performance.py'')

:: per-question statistics
exact match: 90 questions
partial_match: 270 questions
not found: 182 questions
partial or exact match: 66.4206642066% questions
precision 40.999%, recall 57.472%, F1 45.425%

:: per-entity statistics
precision: 406/767, 52.9335071708% 
recall: 406/72,1 56.3106796117% 
F1 54.570%

Verdict: Too weak recall, slow.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity Linking

Entity Linking

Impact

Baseline Approach [1.3]

Baseline Approach [v1.2]

Investigated Alternatives

Other Ideas

CrossWikis dataset experiments

Combined Lookup Experiments

Retesting with New Baseline

Results on other datasets

DBpedia Spotlight Experiments

Clone this wiki locally