Carnagie Mellon University researchers want to bring automatic speech recognition to nearly 2,000 languages

According to UNESCO’s World Language Atlas, there are 8,324 languages ​​(spoken and signed) documented by governments. About 200 of the spoken languages ​​benefit from modern language technologies such as voice-to-text transcription, automatic captioning, instant translation and speech recognition. Carnegie Mellon University researchers plan to increase the number of languages ​​to about 2,000 with automatic speech recognition tools.

State-of-the-art speech recognition models are based on large supervised datasets, which are not available for many low-resource languages. A team of Carnegie Mellon researchers set out to simplify the data requirements needed to build a speech recognition model of languages. The team consists of Xingjian Li, a doctoral student at the School of Computing’s Language Technology Institute (LTI). He presented his work along with LTI faculty members Shinji Watanabe, Florian Metze, David Mortensen and Alan Black. ASR2K: silent speech recognition for nearly 2000 languages ​​» At Interspeech 2022 in South Korea.

Xinjian Li comments:

“Many people in this world speak different languages, but language technology tools are not designed for everyone. Developing technology and a good linguistic model for everyone is one of the goals of this research.”

Most speech recognition models require two sets of data: text and audio. While it is easy to collect text data for thousands of languages, audio data can be rarer. The team hopes to eliminate the need for the latter by focusing on linguistic elements common to many languages.

Aaron Aupperlee, Senior Director of Media Relations at Carnagie Mellon University, explains the study in a Science Daily article:

“Historically, speech recognition technologies have focused on the phonemes of language. These distinct sounds that distinguish one word from another—like the “d” that distinguishes “dog” from “log” and “tooth”—are unique to each language. But languages ​​also have phonemes that describe how a word physically sounds. One phoneme can correspond to several phones. Thus, although different languages ​​have different phonemes, their basic phones may be the same.

He adds:

“The LTI team is developing a speech recognition model that moves away from phonemes and instead relies on information about how phonemes are shared between languages, reducing the effort of building separate models for each language. In particular, it pairs the model with a phylogenetic tree—a diagram showing the relationships between languages—to help with pronunciation rules. Using their model and tree, the team can predict voice patterns for thousands of languages ​​without voice data.

So the researchers built a speech recognition pipeline for 1,909 languages, and according to Xijian Li, “This is the first study to target such a large number of languages, and we are the first team to extend linguistic tools to this scope.”

For him, this research is not only about making language technologies accessible to everyone, but also about preserving culture. He declares:

“Each language is a very important factor in its culture. “Every language has its own history and if you don’t try to preserve languages, those histories can be lost. Developing this type of speech recognition system and this tool is a step taken to preserve these languages.”

While still in its early stages, the research has improved existing language approximation tools by a modest 5%, but the team hopes it will not only be an inspiration for their future work, but also for other researchers.

Sources of the article:

Carnegie Mellon University. “The project aims to expand language technologies: the research could extend automatic speech recognition to 2,000 languages. »ScienceDaily. ScienceDaily, 10 January 2023. Originally written by Aaron Aupperlee. .

Li, X., Metze, F., Mortensen, DR, Black, AW, Watanabe, S. (2022) ASR2K: Speech recognition for nearly 2000 languages ​​without noise. proc. Interspeech 2022, 4885-4889, doi: 10.21437/Interspeech.2022-10712

Leave a Reply

Your email address will not be published. Required fields are marked *