
Furthermore, one can, indeed, use an existing archive of query-answer pairs to explicitly deal with repeatability. There is an excellent pre-neural paper on this topic: https://dl.acm.org/doi/pdf/10.1145/1390334.1390416
Finding the data leak caused some people suggest that we need a principally different evaluation approach such as human perception eval. I will solve no problem. We just need to better understand what the models actually learn from data and learn to detect data leaks.
Last but not least, I have recently written a script to detect leaks in my own random split of community QA data. Although a small fraction of questions do repeat (about 4%), the answers are nearly always *VERY* different (Jaccard < 0.25). I hope to run this on NQ & WebQuestions.
PS: In conclusion I would like to re-iterate that given a collection of existing QA pairs, it can certainly be re-used to improved matching (see the paper cited earlier in the thread). So @pat_verga is not fully correct IMHO when he says such a solution cannot generalize.