How do you teach human interaction to robots? Lots of TV
CAMBRIDGE, Mass. -- Remember the Jetsons' robot maid, Rosie? Massachusetts Institute of Technology researchers think her future real-life incarnations can learn a thing or two from Steve Carell and other sitcom stars.
MIT says a computer that binge-watched YouTube videos and TV shows such as "The Office," ''Big Bang Theory" and "Desperate Housewives" learned how to predict whether the actors were about to hug, kiss, shake hands or slap high fives - advances that eventually could help the next generation of artificial intelligence function less clumsily.
"It could help a robot move more fluidly through your living space," lead researcher Carl Vondrick told The Associated Press in an interview. "The robot won't want to start pouring milk if it thinks you're about to pull the glass away."
Vondrick also sees potential health-care applications: "If you can predict that someone's about to fall down or start a fire or hurt themselves, it might give you a few seconds' advance notice to intervene."
The findings - two years in the making at MIT's Computer Science and Artificial Intelligence Laboratory - will be presented at next week's International Conference on Computer Vision and Pattern Recognition in Las Vegas.
Vondrick, a doctoral candidate focusing on computer vision and machine learning with grants from Google and the National Science Foundation, worked with MIT professor Antonio Torralba and Hamed Pirsiavash, now at the University of Maryland. The trio wanted to see if they could create an algorithm that could mimic a human being's intuition in anticipating what will happen next after two people meet.
To refine what's known in artificial intelligence studies as "predictive vision," they needed to expose their machine-learning system to video showing humans greeting one another.
Cue what Vondrick acknowledges were "random videos off YouTube." Six hundred hours of them, to be precise.
The researchers downloaded the videos and converted them into visual representations - a sort of numerical interpretation of pixels on a screen that the algorithm could read and search for complex patterns.
They then showed the computer clips from TV sitcoms it had never seen before - interactions between "Big Bang Theory" stars Jim Parsons (Sheldon Cooper) and Kaley Cuoco (Penny), for example - and asked the algorithm to predict one second later whether the two would hug, kiss, shake hands or high-five.
The computer got it right more than 43 percent of the time. That may not sound like much, but it's better than existing algorithms with a 36 percent success rate. Humans make the right call 71 percent of the time.
In a video trailer of the study that showed the algorithm blowing it on a clip from "The Office," the researchers quipped: "So it's not perfect ... still a long way to go."
That likely will involve even more binge-watching. Six hundred hours of video sounds like a lot, but it's not really that much. By the time we're 10 years old, we've logged nearly 60,000 hours of waking-hours experience.
"Humans are really good at predicting the immediate future," Pirsiavash, the team member now based in Baltimore, said Wednesday. "To have robots interact with humans seamlessly, the robot should be able to reason about the immediate future of our actions."
Martial Hebert, director of the robotics institute at Carnegie Mellon University in Pittsburgh, who was not involved in the MIT study, called it "an important work."
"Some argue that prediction is a central part of (artificial) intelligence," Hebert said. "If you have a robot that can predict, you can map a deeper and more complicated understanding of the environment around it."
The researchers' biggest relief? The computer did all the binge-watching.
"We never had to watch the videos," Vondrick said.