👇This is, indeed, a bit surprising. On 2d thought, real world IR engines often have repeated queries (up to 50%) so performing well on these is clearly important. That said, for evaluation purposes we would like to handle a no-repeating case query separately. https://twitter.com/PSH_Lewis/status/1291739650567045121
Furthermore, one can, indeed, use an existing archive of query-answer pairs to explicitly deal with repeatability. There is an excellent pre-neural paper on this topic: https://dl.acm.org/doi/pdf/10.1145/1390334.1390416
Finding the data leak caused some people suggest that we need a principally different evaluation approach such as human perception eval. I will solve no problem. We just need to better understand what the models actually learn from data and learn to detect data leaks.
Last but not least, I have recently written a script to detect leaks in my own random split of community QA data. Although a small fraction of questions do repeat (about 4%), the answers are nearly always *VERY* different (Jaccard < 0.25). I hope to run this on NQ & WebQuestions.
PS: In conclusion I would like to re-iterate that given a collection of existing QA pairs, it can certainly be re-used to improved matching (see the paper cited earlier in the thread). So @pat_verga is not fully correct IMHO when he says such a solution cannot generalize.
You can follow @srchvrs.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.