MAViD: A Multimodal Framework for Audio-Visual Dialogue Understanding and Generation

1 Tsinghua University 2 Meituan

* Equal contribution, Corresponding authors

Research Paper

Abstract

We propose MAViD, a novel Multimodal framework for Audio-Visual Dialogue understanding and generation. Existing approaches primarily focus on non-interactive systems and are limited to producing constrained and unnatural human speech. The primary challenge of this task lies in effectively integrating understanding and generation capabilities, as well as achieving seamless multimodal audio-video fusion. To solve these problems, we propose a Conductor–Creator architecture that divides the dialogue system into two primary components. The Conductor is tasked with understanding, reasoning, and generating instructions by breaking them down into motion and speech components, thereby enabling fine-grained control over interactions. The Creator then delivers interactive responses based on these instructions. Furthermore, to address the difficulty of generating long videos with consistent identity, timbre, and tone using dual DiT structures, the Creator adopts a structure that combines autoregressive (AR) and diffusion models. The AR model is responsible for audio generation, while the diffusion model ensures high-quality video generation. Additionally, we propose a novel fusion module to enhance connections between contextually consecutive clips and modalities, enabling synchronized long-duration audio-visual content generation. Extensive experiments demonstrate that our framework can generate vivid and contextually coherent long-duration dialogue interactions and accurately interpret users' multimodal queries.

Multi-Turn Dialogue
Please turn on the sound for watching.

Dialogue 1

Loading...

Dialogue 2

Loading...

Dialogue 2

Loading...
Motion Interactive Dialogue
Please turn on the sound for watching.
Loading...
Loading...
Loading...
Loading...
Compare With Baselines For Joint Audio-Video Generation (T2AV Task)
Please turn on the sound for watching.
UniVerse-1
Loading...
OVI
Loading...
MAViD (Ours)
Loading...

Ethics Statement

We acknowledge that Text-to-Sounding-Video generation technology, like other generative technologies, carries potential risks of misuse. The ability to create realistic and synchronized audio-visual content from text could be exploited to generate convincing disinformation and fraudulent materials. The primary motivation for our research, however, is positive. We believe this technology holds significant potential for beneficial applications. We are committed to the responsible advancement of this field and encourage continued research into synthetic content detection and the establishment of clear ethical guidelines for deployment.

This page was built based on this project.