Hm. Aren't these just half-truths? At MS Translator we do both, we try to harvest as much free data as possible and we create custom translations where required. What we have found in MT is that data creation for ML purposes is a surprising dead end even when paying >> https://twitter.com/emilymbender/status/1353897774840864768
>> above-market rates to language service providers. The data is artificial, the translators seem demotivated by the lack of actual real world purpose, quality remains low.

In contrast, harvested translations come from real communicative situations, tends to be high quality >>
>> when selected from sources where native language is fairly certain. The result is free-ish data that is actually higher quality than the data we paid so much money for and orders of magnitude larger. Is that so unlikely to be the case for LM-purposes?
The exception here is low-resource languages where free or in fact any data is so scarce that data creation nearly immediately results in the creation of the biggest resource for that language in existence. That's usually totally worth it.
You can follow @marian_nmt.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.