It's been 1.25 years since we kicked off this African Language Dataset program, which to be honest did not start as something fully fledged. We've done a lot of 'feeling around in the dark' and fast forward to now, there's loads of output to share! https://twitter.com/siminyu_kat/status/1191879356370558976?s=20
After crowdsourcing data from pretty much anyone, we selected several from among the outstanding submissions and kicked off a fellowship program to support them to do more. You can read more about the why in this paper which was presented at WiNLP 2020.

https://arxiv.org/abs/2007.11865 
Key outputs are some pretty amazing datasets, research papers in the works and several ongoing challenges on Zindi with US$2000 each up as prize money.
From @davlanade and collaborators, we have MENYO-20k: A Multi-domain English - Yorùbá Corpus for MT. https://twitter.com/davlanade/status/1333758620123754499?s=20
And the Zindi challenge for this is open until April 12th. https://twitter.com/ZindiAfrica/status/1336203864249225216?s=20
The folks at @TakwimuLab worked also worked on MT datasets, for French-Ewe and French Fongbe and that Zindi competition is open till 26th April. https://twitter.com/TakwimuLab/status/1340994670944546816?s=20
The folks at iCompass worked on a Tunisian-Arabizi dataset for Sentiment Analysis and that Zindi competition is open till the 29th of March. https://twitter.com/ZindiAfrica/status/1349612215704322049?s=20
From @AT_poly_AI and her collaborators, we have a Chichewa News Classification dataset and a Zindi competition running till 10th May. https://twitter.com/ZindiAfrica/status/1352617080990797824?s=20
We also have Wolof TTS and ASR datasets courtesy of @bayethiernodiop and @baamtusarl and a Zindi challenge based on the ASR data will be up soon.
In addition to a Kiswahili News Classification dataset by @Davis_McDavid, a Twi MT dataset courtesy of @GhanaNLP, an MT dataset created from SA parliamentary documents for the 11 national languages courtesy of @Legend_Ari and @MaSelinga...
These are all already available or will soon be available on @ZENODO_ORG. Check out the African NLP community here -> https://zenodo.org/communities/africanlp/?page=1&size=20.

...the final key output is that sometimes you try things and they work, so keep trying.
I am keen to see more of a research focus that centres the African experience. Where are the MT works from AfricanLanguageX to AfricanLanguageY?
Whose working on better evaluations metrics since we have a ascertained that existing ones don't perform as expected on our languages?
What about more multi-disciplinary collaboration that can mean our NLP researchers aren't being bogged down by the task of data creation/scraping?
Hello Linguistics, Literature, Language Pedagogy departments, what do you have hiding in your siloed corner of the university?
And multi-disciplinary collaboration that can mean we are working with more contextually and culturally relevant data. Hello African studies, African folklore, Anthropology departments. We could be doing so much more...together.
And let's not forget the economic value. How do we make sure local startups, that pay taxes in our countries, that hire local people, are positioned to reap value for these technological tools. And that the net effect is an improved quality of life for people at home.
How can our governments plug into and support these efforts? You, African Gvt, really don't need to pay a Silicon valley start-up to give you insights from satellite imagery. I promise, you don't. Not if you have an IndabaX community in your country.

But I digress now.
In conlcusion, I repeat, the final key output is that sometimes you try things and they work, so keep trying.

Sending sunshine
🌥️🌥️🌥️🌤️🌤️🌤️☀️☀️☀️
You can follow @siminyu_kat.
Tip: mention @twtextapp on a Twitter thread with the keyword “unroll” to get a link to it.

Latest Threads Unrolled:

By continuing to use the site, you are consenting to the use of cookies as explained in our Cookie Policy to improve your experience.