Machine Language

How does one go about teaching a machine a human language? This was the question DARPA (Defense Advanced Research Projects Agency) had in mind when it issued a call for the creation of an organization to support human language technology research and development—a call answered with the establishment of the Linguistic Data Consortium (LDC) at Penn. “Think of it as a semester abroad for computers,” jokes Mark Liberman, the Trustee Professor of Phonetics and Professor of Computer and Information Science who founded the Consortium in 1992. “Learning a language takes a lot of experience.”

In order to gain this “experience,” a computer requires vast amounts of human language data, as well as directions for interpretation. These collections are often too time-consuming and expensive for individual research groups to create, and in providing shared resources to speech and language researchers around the world, the Consortium helps facilitate intellectual exchange. The LDC also acts as an intermediary for intellectual property rights. Its contracts with over 70 data providers allow researchers at more than a thousand institutions to use billions of words of text, and tens of thousands of hours of speech, without violating the copyrights of publishers and broadcasters.

LDC team members are skilled in some combination of linguistics, computer science and project management. Though many of the methods the Consortium uses to gather speech and language data are automated, the majority of data sets depend on human analysis. One of the key research areas the LDC supports is speech recognition (speech-to-text). Speech is collected from a variety of different sources, including satellite dishes and cable feeds. It is also captured from human subjects recruited to participate in telephone conversations and face-to-face interviews. Afterwards, the recordings are transcribed and stored, along with information about the source and the recording process.

In order to learn to “understand” speech or text, machines—just like humans need information about meaning. To provide these data, LDC annotators are often asked to tag texts for “entities,” such as people or places. Researchers then use these tagged texts to develop and test programs that can extract the same sort of information automatically from new material.

To develop and test methods for speaker identification, researchers need examples of speakers recorded in multiple places, talking about multiple topics, using multiple recording devices. Otherwise, instead of learning to recognize differences among speakers, machine algorithms would learn to recognize differences among microphones, differences between rooms, or even differences between casual conversations and formal interviews.

Technology derived from this research could eventually lend authorities the ability to match threatening phone calls to suspects in custody, or enable a telephone banking service to identify a customer by voice alone.

Some Consortium projects work toward a very different goal, explains Chris topher Cieri,  LDC Executive Director since 1998. For example, annotator Alyaa Abbood has spent the last two years updating a 1960s-era Iraqi-Arabic dictionary, a U.S. Department of Education–sponsored collaboration between the Consortium and Georgetown University Press. Her work will lead to a new, standardized edition for use in academia and other venues. In all, the Consortium has published data containing material in 75 languages.

In addition to his professorial duties and work at the LDC, Consortium founder Mark Liberman is also Faculty Director of College Houses and Academic Services and founder of Language Log, a blog that presents linguistic research in a popular form and dissects linguistic idiosyncrasies in popular media and literature. “The LDC has played an important role in the last 20 years of progress,” Liberman says. “We continue to be in the middle of exciting new developments. I look forward to an increased impact on speech and language science, and to applications in new areas.”

Arts & Sciences News

Penn Arts & Sciences Faculty Honored by University for Distinguished Teaching

The faculty represent the Departments of Physics and Astronomy, English, History of Art, and Chemistry.

View Article >
Greg Ridgeway Named Rebecca W. Bushnell Professor of Criminology

He is also the Department Chair of Criminology and Professor of Statistics and Data Science.

View Article >
Junhyong Kim Named Christopher H. Browne Distinguished Professor of Biology

Kim is an expert in genomics, single cell biology, mathematical and computational biology, and evolutionary genetics.

View Article >
Penn Arts & Sciences Students Win 2024 President’s Engagement Prize

They will design and undertake post-graduation projects that make a positive, lasting difference in the world.

View Article >
2024 School of Arts & Sciences Teaching Awards

Penn Arts & Sciences recognizes nine faculty and seven graduate students for their distinguished teaching.

View Article >
Four from the College Receive Projects for Peace Grant

Annabelle Jin, C’25, Claire Jun, C’25, Johana Munoz, C’26, and Destiny Uwawuike, C’25, were selected for their initiative Students Organizing for Access to Reproductive Health.

View Article >