1/
Innovation, access and safety demand that technology speak all of the world’s languages. 

New work in @sciam with my inspiring friends/collaborators Moussa Doumbouya & @chrispiech. We will present the full paper @RealAAAI conference this week
https://www.scientificamerican.com/article/why-ai-needs-to-be-able-to-understand-all-the-worlds-languages/





2/ Computer systems should adapt to the ways people - all people - use language. West Africans have spoken their languages for thousands of yrs. creating rich oral history traditions. Computers could easily support this oral tradition. 


3/ Speech-based technology does exist, however popular products do not “speak” any of the 2000 languages & dialects spoken by Africans. Apple’s Siri, Google Assistant, & Amazon’s Alexa collectively service 0 African languages.

4/Because illiteracy tends to correlate with lack of schooling & thus, the inability to speak a common world language, speech technology is not available to those who need it the most.For them, such technology could bridge the gap between illiteracy and digital contributions


5/ Why the gap? Languages spoken by smaller populations are often casualties of commercial prioritization.
Furthermore, groups with power over technological goods tend to speak the same few languages, making it easy to insufficiently consider those with different backgrounds.

6/ Speakers of languages such as those widely spoken in West Africa are grossly underrepresented in the research labs, companies and universities that have historically developed speech-recognition technologies.


7/
All exacerbate another critical challenge: lack of data. Languages spoken by illiterate people who would most benefit from voice recognition tech tend to fall in the “low-resource" category, which, in contrast to “high-resource” languages, have few available datasets.

8/ Moussa, Chris, and I are tackling this problem. We developed the first speech recognition models for Maninka, Pular Susu, languages spoken by a combined 10 million people in seven countries with up to 68 percent illiteracy. Full paper 
https://mdoumbouya.github.io/nicolingua.pdf


https://mdoumbouya.github.io/nicolingua.pdf
9/ We leveraged speech data that is abundantly available, even in low-resource languages: radio broadcasting archives.
We are releasing our data sets, code and models to the research community in hopes of catalyzing further efforts in these areas. https://github.com/mdoumbouya/nicolingua

10/ Computers are not yet sufficiently evolved to be useful in some societies. Our friends should not have to read and write a common language to contribute to scientific research, much less to merely interact with their smartphones. 


11/ Yes, it is challenging to create computers that understand the subtleties of oral communication in thousands of languages rich in oral features such as tone and other high-level semantics. But where researchers turn their attention, progress can be made.

12/ Innovation, access and safety demand that technology speak all of the world’s languages.
@StanfordEng @Stanford @PeaceCorps @WHOSTP
@FSIStanford @stanfordio @DigCivSoc @StanfordHAI @StanfordCyber @StanfordPACS @ai4allorg @stanfordnlp @StanfordAILab
@StanfordEng @Stanford @PeaceCorps @WHOSTP
@FSIStanford @stanfordio @DigCivSoc @StanfordHAI @StanfordCyber @StanfordPACS @ai4allorg @stanfordnlp @StanfordAILab