SpecMaskGIT: Masked Generative Modeling of Audio Spectrograms for Efficient Audio Synthesis and Beyond

1Queen Mary University of London, 2Sony Group Corporation, 3Sony AI
*Equal contribution, #Work done during an internship at Sony

Abstract

Recent advances in generative models that iteratively synthesize audio clips sparked great success to text-to-audio synthesis (TTA), but with the cost of slow synthesis speed and heavy computation. Although there have been attempts to accelerate the iterative procedure, high-quality TTA systems remain inefficient due to hundreds of iterations required in the inference phase and large amount of model parameters. To address the challenges, we propose SpecMaskGIT, a light-weighted, efficient yet effective TTA model based on the masked generative modeling of spectrograms. First, SpecMaskGIT synthesizes a realistic 10-sec audio clip by less than 16 iterations, an order-of-magnitude less than previous iterative TTA methods. As a discrete model, SpecMaskGIT outperforms larger VQ-Diffusion and auto-regressive models in the TTA benchmark, while being real time with only 4 CPU cores or even 30x faster with a GPU. Next, built upon a latent space of Mel-spectrogram, SpecMaskGIT has a wider range of applications (e.g., the zero-shot bandwidth extension) than similar methods built on the latent wave domain. Moreover, we interpret SpecMaskGIT as a generative extension to previous discriminative audio masked Transformers, and shed light on its audio representation learning potential. We hope our work inspires the exploration of masked audio modeling toward further diverse scenarios.

Text-to-audio Synthesis

Text prompts SpecMaskGIT (AS w/ BigVSAN) SpecMaskGIT (AS w/ HiFiGAN) SpecMaskGIT (ACFT w/ HiFiGAN) TANGO
The sound of monster and creatures.
The sound of foley footsteps on the concrete street.
The sound of foley footsteps on the winter street.
Repeated gunfire and screaming in the background.
Alien robot android is speaking.
Birds chirping as well as some clanking.
A man is speaking under the water.
A man is speaking in a huge room.
A man is speaking in a small room.
Chopping tomatos on a wooden table.
Chopping meat on a wooden table.
Chopping potatos on a metal table.
Typing on a typewriter.
The steady crashing of waves against the shore,high fidelity, the whooshing sound of water receding back into the ocean.

Zero-shot Bandwidth Extension (BWE) and Time Inpainting

Unprocessed SpecMaskGIT AS HiFiGAN - w/ LFR Ground truth

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features
Unprocessed SpecMaskGIT AS HiFiGAN Ground truth

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Features

Representation Learning: Linear Probing

ROC-AUC and mAP for the multi-label music tagging on the MagnaTagATune dataset. Bold: Top-2 results.
Model ROC mAP
CLMR 89.4 36.1
Data2vec-music 90.0 36.2
HuBERT-music 90.2 37.7
MusiCNN 90.6 38.3
SlowFast-NFNet-F0 - 39.5
MERT-330M 91.3 40.2
MULE-contrastive 91.4 40.4
Jukebox 91.5 41.4
SpecMaskGIT (ours) 91.5 40.5

BibTeX


        To be updated