Thread by @srchvrs, This is, indeed, a bit surprising. On 2d thought, real world IR [...]

Leo Boytsov

srchvrs

This is, indeed, a bit surprising. On 2d thought, real world IR engines often have repeated queries (up to 50%) so performing well on these is clearly important. That said, for evaluation purposes we would like to handle a no-repeating case query separately. https://twitter.com/PSH_Lewis/status/1291739650567045121

https://twitter.com/PSH_Lewis/status/1291739650567045121

Furthermore, one can, indeed, use an existing archive of query-answer pairs to explicitly deal with repeatability. There is an excellent pre-neural paper on this topic: https://dl.acm.org/doi/pdf/10.1145/1390334.1390416

Finding the data leak caused some people suggest that we need a principally different evaluation approach such as human perception eval. I will solve no problem. We just need to better understand what the models actually learn from data and learn to detect data leaks.

Last but not least, I have recently written a script to detect leaks in my own random split of community QA data. Although a small fraction of questions do repeat (about 4%), the answers are nearly always *VERY* different (Jaccard < 0.25). I hope to run this on NQ & WebQuestions.

PS: In conclusion I would like to re-iterate that given a collection of existing QA pairs, it can certainly be re-used to improved matching (see the paper cited earlier in the thread). So @pat_verga is not fully correct IMHO when he says such a solution cannot generalize.

You can follow @srchvrs.

Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled: