Artificial intelligence has broken through a sound barrier. Researchers from Massachusetts Institute of Technology have developed an AI system that “watches” a silent video clip and generates a sound so convincing that most human viewers cannot tell whether it is computer-generated.
The MIT Computer Science and Artificial Intelligence Lab says its “deep learning algorithm” is the first to pass a “Turing test for sound” — making noises indistinguishable from the real thing.
The Visually Indicated Sounds system, Vis, was trained to analyse the sounds made when a stick hits, scrapes or prods a huge variety of objects, soft and hard, from leaves and water to soil and steel.
The Vis repertoire could be extended in future to many other settings, the researchers say. Future versions could, for example, produce more realistic sound effects for film and television than traditional methods such as dropping salt on aluminium foil to imitate rainfall.
A more significant application may be to help robots understand objects’ physical properties and interact better with their surroundings, says Andrew Owens, leader of the project which will be presented later this month at the annual Computer Vision and Pattern Recognition conference in Las Vegas.
“A robot could look at a sidewalk and instinctively know that the cement is hard and the grass is soft, and therefore know what would happen if it stepped on either of them,” he said. “Being able to predict sound is an important first step toward being able to predict the consequences of physical interactions with the world.”
The MIT team “trained” Vis by feeding in 1,000 videos including 46,000 sounds made by a drumstick hitting or moving through objects of different consistencies. Then a “deep learning” algorithm, which enables a computer to find patterns within vast quantities of data, deconstructed the sounds.
To predict a new sound from a silent film clip, Vis looks at the audio properties most likely to be associated with each video frame and knits them together into a coherent sound.
Vis can simulate the subtleties of quick and slow sounds, from staccato taps on a rock to gentle rustling through ivy. It manages low-pitched thuds against a cushion and high-pitched clicks against a railing.
To test the realism of the sounds for human listeners, the researchers carried out a survey with 400 viewers who saw video clips twice, once with the real sound and one with Vis’s version. They had to say which was real.
If Vis made sounds that were indistinguishable from reality, they would be chosen half the time. In fact it achieved a very creditable 40 per cent score.
The system is least successful when the sounds are clean and sharp, such as hitting wood or metal, and best at reproducing softer and more drawn-out sounds made by leaves or dirt. It sometimes also “hallucinates” a false hit if the stick stops just short of its target.
Mr Owens dreams now of simulating sounds where there is no clear visual clue. “From the gentle blowing of the wind to the buzzing of laptops, at any given moment there are so many ambient sounds that aren’t related to what we’re actually looking at,” he says. “What would be really exciting is to somehow simulate sound that is less directly associated to the visuals.”