A new AI system can create natural-sounding speech and music after being prompted with a few seconds of audio.
AudioLM, developed by Google researchers, generates audio that fits the style of the prompt, including complex sounds like piano music, or people speaking, in a way that is almost indistinguishable from the original recording. The technique shows promise for speeding up the process of training AI to generate audio, and it could eventually be used to auto-generate music to accompany videos.
AI-generated audio is commonplace: voices on home assistants like Alexa use natural language processing. AI music systems like OpenAI’s Jukebox have already generated impressive results, but most existing techniques need people to prepare transcriptions and label text-based training data, which takes a lot of time and human labor. Jukebox, for example, uses text-based data to generate song lyrics.
AudioLM, described in a non-peer-reviewed paper last month, is different: it doesn’t require transcription or labeling. Instead, sound databases are fed into the program, and machine learning is used to compress the audio files into sound snippets, called “tokens,” without losing too much information. This tokenized training data is then fed into a machine-learning model that uses natural language processing to learn the sound’s patterns.
To generate the audio, a few seconds of sound are fed into AudioLM, which then predicts what comes next. The process is similar to the way language models like GPT-3 predict what sentences and words typically follow one another.
The audio clips released by the team sound pretty natural. In particular, piano music generated using AudioLM sounds more fluid than piano music generated using existing AI techniques, which tends to sound chaotic.
Roger Dannenberg, who researches computer-generated music at Carnegie Mellon University, says AudioLM already has much better sound quality than previous music generation programs. In particular, he says, AudioLM is surprisingly good at re-creating some of the repeating patterns inherent in human-made music. To generate realistic piano music, AudioLM has to capture a lot of the subtle vibrations contained in each note when piano keys are struck. The music also has to sustain its rhythms and harmonies over a period of time.
“That’s really impressive, partly because it indicates that they are learning some kinds of structure at multiple levels,” Dannenberg says.
AudioLM isn’t only confined to music. Because it was trained on a library of recordings of humans speaking sentences, the system can also generate speech that continues in the accent and cadence of the original speaker—although at this point those sentences can still seem like non sequiturs that don’t make any sense. AudioLM is trained to learn what types of sound snippets occur frequently together, and it uses the process in reverse to produce sentences. It also has the advantage of being able to learn the pauses and exclamations that are inherent in spoken languages but not easily translated into text.
Rupal Patel, who researches information and speech science at Northeastern University, says that previous work using AI to generate audio could capture those nuances only if they were explicitly annotated in training data. In contrast, AudioLM learns those characteristics from the input data automatically, which adds to the realistic effect.
“There is a lot of what we could call linguistic information that is not in the words that you pronounce, but it’s another way of communicating based on the way you say things to express a specific intention or specific emotion,” says Neil Zeghidour, a co-creator of AudioLM. For example, someone may laugh after saying something to indicate that it was a joke. “All that makes speech natural,” he says.
Eventually, AI-generated music could be used to provide more natural-sounding background soundtracks for videos and slideshows. Speech generation technology that sounds more natural could help improve internet accessibility tools and bots that work in health care settings, says Patel. The team also hopes to create more sophisticated sounds, like a band with different instruments or sounds that mimic a recording of a tropical rainforest.
However, the technology’s ethical implications need to be considered, Patel says. In particular, it’s important to determine whether the musicians who produce the clips used as training data will get attribution or royalties from the end product—an issue that has cropped up with text-to-image AIs. AI-generated speech that’s indistinguishable from the real thing could also become so convincing that it enables the spread of misinformation more easily.
In the paper, the researchers write that they are already considering and working to mitigate these issues—for example, by developing techniques to distinguish natural sounds from sounds produced using AudioLM. Patel also suggested including audio watermarks in AI-generated products to make them easier to distinguish from natural audio.