SpecMaskFoley Demo Page

Abstract

Foley synthesis aims to synthesize high-quality audio that is both semantically and temporally aligned with video frames. Given its broad application in creative industries, the task has gained increasing attention in the research community. To avoid the non-trivial task of training audio generative models from scratch, adapting pretrained audio generative models for video-synchronized foley synthesis presents an attractive direction. ControlNet, a method for adding fine-grained controls to pretrained generative models, has been applied to foley synthesis, but its use has been limited to handcrafted human-readable temporal conditions. In contrast, from-scratch models achieved success by leveraging high-dimensional deep features extracted using pretrained video encoders. We have observed a performance gap between ControlNet-based and from-scratch foley models. To narrow this gap, we propose SpecMaskFoley, a method that steers the pretrained SpecMaskGIT model toward video-synchronized foley synthesis via ControlNet. To unlock the potential of a single ControlNet branch, we resolve the discrepancy between the temporal video features and the time-frequency nature of the pretrained SpecMaskGIT via a frequency-aware temporal feature aligner, eliminating the need for complicated conditioning mechanisms widely used in prior arts. Evaluations on a common foley synthesis benchmark demonstrate that SpecMaskFoley could even outperform strong from-scratch baselines, substantially advancing the development of ControlNet-based foley synthesis models.

Take Home Message

DeSync vs FAD
Fig. 1: Audio synthesis quality (FAD) and audio-video temporal alignment of different methods. The proposed SpecMaskFoley achieves competitive scores in both axes without the non-trivial from-scratch training.

SpecMaskFoley Overview
Fig. 2: Overview of SpecMaskFoley. Ice icons: frozen modules. Fire icons: trainable modules. A CLAP embedding is treated as a conditional mask C following SpecMaskGIT to condition the audio backbone with audio prompts during training and text prompts during inference. SpecMaskFoley only uses a single ControlNet branch as the conditioning mechanism to inject video feature, which is more simplified compared with prior arts.

Go to the demo page of SpecMaskGIT, the pretrained backbone of SpecMaskFoley

Update 2025 July 05: For in-domain sampoles, added few-step results from SpecMaskFoley for example2, example6 and extra example3.

1. In-domain VGGSound Samples

Notice: our model might produce slightly longer samples than others, like in example 1 and 6. This is because we send ~10s (9.85s) video to SpecMaskFoley without cutting it to 8s.

Extra VGGSound examples: Moving train; Water splashing (with few-step samples); Skateboarding (with few-step samples); Synchronized clapping

2. Out-of-domain Samples

Videos Generated by MovieGen

BibTeX


        @article{zhong2025specmaskfoley,
        title={SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet},
        author={Zhong, Zhi and Takahashi, Akira and Cui, Shuyang and Toyama, Keisuke and Takahashi, Shusuke and Mitsufuji, Yuki},
        journal={arXiv preprint arXiv:2505.16195},
        year={2025}
      }

SpecMaskFoley: Steering Pretrained Spectral Masked Generative Transformer Toward Synchronized Video-to-audio Synthesis via ControlNet

Abstract

Take Home Message

Go to the demo page of SpecMaskGIT, the pretrained backbone of SpecMaskFoley

1. In-domain VGGSound Samples

Go to Example 1: Wolf howling (with few-step samples)

Go to Example 2: Striking a golf ball (with few-step samples)

Go to Example 3: Hitting a drum (with few-step samples)

Go to Example 4: Dog barking (with few-step samples)

Go to Example 5: Playing a string instrument (with few-step samples)

Go to Example 6: A group of people playing tambourines (with few-step samples)

Extra VGGSound examples: Moving train; Water splashing (with few-step samples); Skateboarding (with few-step samples); Synchronized clapping

2. Out-of-domain Samples

Videos Generated by MovieGen

BibTeX

Jump to VisualEchoes, our previous work for joint audio-visual generative modelling