Blogs
Google DeepMind Unveils New Model: Generating Audio for Video
DeepMind's V2A technology generates synchronized soundtracks for silent videos using video pixels and text prompts, enhancing creative possibilities in video generation.
3

minutes

June 17, 2024
This content was generated using AI and curated by humans

DeepMind's latest research focuses on video-to-audio (V2A) technology, which aims to generate synchronized soundtracks for silent videos using video pixels and text prompts. This technology represents a significant advancement in the field of video generation, which has traditionally produced silent outputs. V2A can be paired with video generation models like Veo to create dramatic scores, realistic sound effects, or dialogue that matches the characters and tone of a video. It can also generate soundtracks for traditional footage, including archival material and silent films, thus expanding creative possibilities.

Key Features of V2A Technology

One of the key features of V2A is its ability to generate an unlimited number of soundtracks for any video input. Users can define 'positive prompts' to guide the generated output toward desired sounds or 'negative prompts' to steer it away from undesired sounds. This flexibility allows for rapid experimentation with different audio outputs to find the best match.

Diffusion-Based Approach

The V2A system employs a diffusion-based approach for audio generation, which has proven to be the most effective in synchronizing video and audio information. The process begins by encoding the video input into a compressed representation. The diffusion model then iteratively refines the audio from random noise, guided by the visual input and natural language prompts. The final audio output is decoded, turned into an audio waveform, and combined with the video data.

Enhanced Audio Quality

To enhance audio quality and guide the model towards generating specific sounds, additional information such as AI-generated annotations with detailed descriptions of sound and transcripts of spoken dialogue were incorporated into the training process. This allows the technology to associate specific audio events with various visual scenes and respond to the information provided in the annotations or transcripts.

Examples

Challenges and Limitations

Despite its advancements, the V2A technology still faces challenges. The quality of the audio output is dependent on the quality of the video input, and artifacts or distortions in the video can lead to a noticeable drop in audio quality. Additionally, improving lip synchronization for videos involving speech remains a challenge. The V2A system attempts to generate speech from input transcripts and synchronize it with characters' lip movements, but mismatches can occur if the paired video generation model is not conditioned on transcripts.

Commitment to Responsible AI Development

DeepMind is committed to developing and deploying AI technologies responsibly. They are gathering diverse perspectives from leading creators and filmmakers to inform their ongoing research and development. The SynthID toolkit has been incorporated into the V2A research to watermark all AI-generated content, safeguarding against potential misuse. Before opening access to the wider public, the V2A technology will undergo rigorous safety assessments and testing. Initial results indicate that this technology holds promise for bringing generated movies to life.

This blog post is AI generated with input from the following sources:

  • Generating Audio for Video: DeepMind's V2A Technology
    Authors: Ankush Gupta, Nick Pezzotti, Pavel Khrushkov, Tobenna Peter Igwe, Kazuya Kawakami, Mateusz Malinowski, Jacob Kelly, Yan Wu, Xinyu Wang, Abhishek Sharma, Ali Razavi, Eric Lau, Serena Zhang, Brendan Shillingford, Yelin Kim, Eleni Shaw, Signe Nørly, Andeep Toor, Irina Blok, Gregory Shaw, Pen Li, Scott Wisdom, Aren Jansen, Zalán Borsos, Brian McWilliams, Salah Zaiem, Marco Tagliasacchi, Ron Weiss, Manoj Plakal, Hakan Erdogan, John Hershey, Jeff Donahue, Vivek Kumar, Matt Sharifi, Benigno Uria, Björn Winckler, Charlie Nash, Conor Durkan, Cătălina Cangea, David Ding, Dawid Górny, Drew Jaegle, Ethan Manilow, Evgeny Gladchenko, Felix Riedel, Florian Stimberg, Henna Nandwani, Jakob Bauer, Junlin Zhang, Luis C. Cobo, Mahyar Bordbar, Miaosen Wang, Mikołaj Bińkowski, Sander Dieleman, Will Grathwohl, Yaroslav Ganin, Yusuf Aytar, Yury Sulsky, Aäron van den Oord, Andrew Zisserman, Tom Hume, RJ Mical, Douglas Eck, Nando de Freitas, Oriol Vinyals, Eli Collins, Koray Kavukcuoglu, Demis Hassabis
    Publish Date: 2024-06-17
Discover More AI Insights
Blogs