Microsoft says speech recognition technology reaches "human parity”
Artificial intelligence just keeps getting smarter and smarter. Now Microsoft researchers say they’ve developed speech recognition technology that can grasp a human conversation as well as people do.
The work out of the Microsoft Artificial Intelligence and Research department was published in a scientific paper this week. It shows that when the speech recognition software “listened” to people talking, it was able to transcribe the conversation with the same or fewer errors than professional – human – transcriptionists.
The technology delivered a word error rate (WER) of 5.9 percent, which is roughly the same as that of people who were asked to transcribe the same conversation.
“We’ve reached human parity,” Xuedong Huang, Microsoft’s chief speech scientist, said in a press release. “This is an historic achievement.”
The achievement is no small feat. The company says this is the first time that a computer has been shown to equal humans in the ability to recognize words. For some members of the research team, the breakthrough happened sooner than expected.
“Even five years ago, I wouldn’t have thought we could have achieved this. I just wouldn’t have thought it would be possible,” Harry Shum, executive vice president in charge of Microsoft’s Artificial Intelligence and Research group, said in the release.
Just as IBM’s Watson cognitive computing system and personal smartphone assistants like Apple’s Siri have been making waves and becoming more useful and versatile, this technology has the potential to have a significant impact. Microsoft could use it to make its own mobile assistant, Cortana, more effective, and to boost the voice command capabilities of its Xbox gaming systems.
Of course, even at the level of “human parity,” the technology isn’t foolproof. The computer did not recognize every word accurately. The researchers found that the computer’s error rate for mishearing “have” for “is,” for example, was the same as you’d expect from a person in a normal conversation.
One of the main ways the team achieved its progress in the field was by utilizing “neural network technology,” where huge chunks of data were used as training sets. Words were presented as “continuous vectors in space,” placed close together to teach the computer to recognize patterns common in actual human speech.
Geoffrey Zweig, manager of the company’s Speech and Dialog research group, said that “this lets the models generalize very well from word to word.”
The system that was used to reach the milestone was the company’s Computational Network Toolkit (CNTK). The CNTK processed deep learning algorithms across multiple computers using a specialized chip that improved the speed.
The next phase of the research involves improving speech recognition in real-life settings, like locations with heavy background noise. The team also hopes that the computer could eventually give names to individual speakers to distinguish when certain people are talking.
For those technology alarmists who worry that this could lead to sentient machines, a la “Terminator,” the research team offered some reassurances. While computers are learning to process language better than ever, true comprehension is still a long way off.
“The next frontier is to move from recognition to understanding,” Zweig said.