Machine Language

How does one go about teaching a machine a human language? This was the question DARPA (Defense Advanced Research Projects Agency) had in mind when it issued a call for the creation of an organization to support human language technology research and development—a call answered with the establishment of the Linguistic Data Consortium (LDC) at Penn. “Think of it as a semester abroad for computers,” jokes Mark Liberman, the Trustee Professor of Phonetics and Professor of Computer and Information Science who founded the Consortium in 1992. “Learning a language takes a lot of experience.”

In order to gain this “experience,” a computer requires vast amounts of human language data, as well as directions for interpretation. These collections are often too time-consuming and expensive for individual research groups to create, and in providing shared resources to speech and language researchers around the world, the Consortium helps facilitate intellectual exchange. The LDC also acts as an intermediary for intellectual property rights. Its contracts with over 70 data providers allow researchers at more than a thousand institutions to use billions of words of text, and tens of thousands of hours of speech, without violating the copyrights of publishers and broadcasters.

LDC team members are skilled in some combination of linguistics, computer science and project management. Though many of the methods the Consortium uses to gather speech and language data are automated, the majority of data sets depend on human analysis. One of the key research areas the LDC supports is speech recognition (speech-to-text). Speech is collected from a variety of different sources, including satellite dishes and cable feeds. It is also captured from human subjects recruited to participate in telephone conversations and face-to-face interviews. Afterwards, the recordings are transcribed and stored, along with information about the source and the recording process.

In order to learn to “understand” speech or text, machines—just like humans need information about meaning. To provide these data, LDC annotators are often asked to tag texts for “entities,” such as people or places. Researchers then use these tagged texts to develop and test programs that can extract the same sort of information automatically from new material.

To develop and test methods for speaker identification, researchers need examples of speakers recorded in multiple places, talking about multiple topics, using multiple recording devices. Otherwise, instead of learning to recognize differences among speakers, machine algorithms would learn to recognize differences among microphones, differences between rooms, or even differences between casual conversations and formal interviews.

Technology derived from this research could eventually lend authorities the ability to match threatening phone calls to suspects in custody, or enable a telephone banking service to identify a customer by voice alone.

Some Consortium projects work toward a very different goal, explains Chris topher Cieri,  LDC Executive Director since 1998. For example, annotator Alyaa Abbood has spent the last two years updating a 1960s-era Iraqi-Arabic dictionary, a U.S. Department of Education–sponsored collaboration between the Consortium and Georgetown University Press. Her work will lead to a new, standardized edition for use in academia and other venues. In all, the Consortium has published data containing material in 75 languages.

In addition to his professorial duties and work at the LDC, Consortium founder Mark Liberman is also Faculty Director of College Houses and Academic Services and founder of Language Log, a blog that presents linguistic research in a popular form and dissects linguistic idiosyncrasies in popular media and literature. “The LDC has played an important role in the last 20 years of progress,” Liberman says. “We continue to be in the middle of exciting new developments. I look forward to an increased impact on speech and language science, and to applications in new areas.”

Arts & Sciences News

Azuma and Hart Named Roy F. and Jeannette P. Nichols Professors of American History

Eiichiro Azuma specializes in Asian American and transpacific history, while Emma Hart teaches and researches the history of early North America, the Atlantic World, and early modern Britain between 1500 and 1800.

View Article >
Arts & Sciences Students Honored during 37th Annual Women of Color Day

Sade Taiwo, C’25, and Kyndall Nicholas, a Ph.D. candidate in neuroscience, were honored for their work.

View Article >
Nine College Students and Alums Named Thouron Scholars; Will Pursue Graduate Studies in the U.K.

The Scholars are six seniors and three recent graduates whose majors range from neuroscience to communication.

View Article >
Irma Elo Named Tamsen and Michael Brown Presidential Professor in Sociology

Elo’s main research interests center on inequalities in health and mortality across the life course and demographic estimation of mortality. In recent years, she has extended her research to include predictors of cognition in high-, middle-, and low-income countries.

View Article >
Julia Hartmann Named Fay R. and Eugene L. Langberg Professor in Mathematics

She specializes in algebra and arithmetic geometry, a newer field that applies techniques from algebraic geometry to solve problems in number theory and co-developed the method of field patching.

View Article >
Holger Sieg Named Baird Term Professor of Economics

Sieg focuses his research on public and urban economics, as well as the political economy of state and local governments.

View Article >