Demo: Extend Audio MAE toward Audio Restoration

This is the demo page for the paper "Extending Audio Masked Autoencoders toward Audio Restoration".

It is highly recommended using a headphone or earphone when listening to these samples.


Table of contents


Pretraining Dataset

We provide several samples from AudioSet and LibriTTS to briefly illustrate features of these datasets.


AudioSet

AudioSet contains around 2 million 10-second (some are shorter than 10s) audio segments taken from YouTube, annotated with 527 diverse classes. AudioSet has been widely used in general audio representation learning.

Sample Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8
Audio








LibriTTS

LibriTTS is a large-scale corpus of English audiobooks presented at sentence break.

Sample Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8
Audio








Audio Classification

We provide several test files from AudioSet and SPCv2 to demonstrate how our models behave in these classification tasks.


Speech Command Ver.2

The dataset presents a 35-class single-label speech command recognition task. The top-1 accuracy in test set is 97.8% for both LibriTTS and AudioSet pretraining.

Label Backward Cat Dog Eight Five Go Happy Learn
Test Audio







Prediction Backward Cat Dog Eight Five Go Happy Learn

AudioSet

AudioSet presents a 527-class multi-label tasks, where various sound sources, events and scenarios are included. Models are finetuned in AudioSet with two settings: AudioSet-2M (summation of unbalanced and balanced subsets) and AudioSet-20k (balance subset alone). We use thresholds optimized on test set itself to produce multi-label decisions.

Sample Sample 1 Sample 2 Sample 3 Sample 4 Sample 5 Sample 6 Sample 7 Sample 8
Labels Animal;"Domestic animals, pets";Dog;Squeak Hammer Speech;Female singing;Child singing;Music Speech;"Walk, footsteps";Music Music,Theme music,Christmas music Speech,Animal,Horse,"Neigh, whinny",Music Thunderstorm,Thunder,Rain,Rain on surface Emergency vehicle,Ambulance (siren),Siren
Audio







AS-2M Prediction Whimper,Animal,"Domestic animals, pets",Dog,Whimper (dog) Hammer Speech;"Child speech, kid speaking";Whimper;Children playing;"Inside, large room or hall";"Outside, urban or manmade" Speech;Music Music,New-age music,Background music,Theme music,Sad music,Tender music Speech,Animal,Horse,"Neigh, whinny",Music Thunderstorm,Thunder,Rain,Raindrop,Rain on surface Motor vehicle (road),Emergency vehicle,Ambulance (siren),"Fire engine, fire truck (siren)",Siren
AS-20K Prediction Whimper;Theremin No tag is predicted Speech;"Child speech, kid speaking";Whimperl;Children playing;Music;"Inside, large room or hall";"Outside, urban or manmade" Music;Mandolin Music;Background music;Theme music;Soundtrack music;Sad music Speech;Horse;"Neigh, whinny";Music Thunderstorm;Thunder;Rain;Raindrop;Rain on surface;Rustle;White noise Vehicle;Motor vehicle (road);"Vehicle horn, car horn, honking";Truck;Emergency vehicle;Ambulance (siren);Siren

Speech Enhancement: Valentini's Dataset

We provide several test files from the standard Valentini’s dataset, which presents a test set of 824 noisy speeches without reverberation.


From-scratch vs. Pretrained

Audio samples for Table. 2 in the paper.

Pretrain From-scratch LibriTTS AudioSet
Method Noisy Vanilla ViT-AE Additive Multiplicative Vanilla Additive Multiplicative Vanilla Additive Multiplicative Vocoder Oracle Clean
p232_005











p232_010











p257_003











p257_008












ViT-AE-iSTFT vs. Mel-to-mel ViT-AE

Audio samples for Table. 3 in the paper.

Pretrain From-scratch LibriTTS AudioSet
Method Noisy ViT decoder with STFT input ViT-AE-iSTFT ViT-AE-iSTFT ViT-AE-iSTFT Multiplicative ViT-AE Clean
p232_005






p232_010






p257_003






p257_008







Comparison with Existing Methods

Audio samples for Table. 4 in the paper. As mentioned in Sec.4.1 in the paper, our system works at the sampling rate of 22.05 kHz for the compatibility with the vocoder. We therefore offer the resampled audio samples of ViT-AE to compare our models with existing diffusion models.

Please notice that these state-of-the-art diffusion models does not work for other tasks such as classification. We leave the improvement of ViT-AE’s objective scores to future work.

Comment ViT-AEs Pretrained on AudioSet) Diffusion
Method Noisy ViT-AE-iSTFT (ours) Multiplicative ViT-AE (ours) Vocoder Oracle UNIVERSE (unofficial) SGMSE+ Clean
p232_005 22k



- -
p232_005 16k






p257_003 22k



- -
p257_003 16k







Speech Enhancement: Out-of-domain DAPS Dataset

Audio samples for Table .5 in the paper.

It is worth noticing that, the Multiplicative ViT-AE pretrained on AudioSet is worse than UNIVERSE in Table. 4, but is better in Table. 5, showing its generalization ability to out-of-domain distortions.

Comment ViT-AEs Pretrained on AudioSet) Diffusion
Method Noisy ViT-AE-iSTFT (ours) Multiplicative ViT-AE (ours) Vocoder Oracle UNIVERSE (unofficial) SGMSE+ Clean
f10_script5_ipad_balcony1 22k



- -
f10_script5_ipad_balcony1 16k






f10_script5_ipadflat_confroom1 22k



- -
f10_script5_ipadflat_confroom1 16k







Bonus: Music Bandwidth Extension

Bandwidth extension (BWE) is also an important restoration task. As mentioned in the paper, pretraining with AudioSet makes the ViT-AE compatible non-speech tasks. However, we want to mention that for music BWE there is no widely adopted benchmark, so we defined our own experimental settings, where the narrowband input at the sampling rate of 11 kHz is converted to the sampling rate of 22 kHz.

We decided only to mention this experiment in this Demo Page as a bonus session. There are 8 sample tracks in the demo, each of which is 30-second long.

We report the log spectral distance (LSD) scores of the samples here: input = 4.02, UNet (our training of [6]) with HiFi-GAN = 2.22, additive ViT-AE (scratch) = 0.96, additive ViT-AE (AudioSet pretrain, proposed) = 0.89 (best).

Track in FMA 001381 011019 013706 014653 076363 085307 110743 113262
11k input







UNet [6] + HiFI-GAN







Additive ViT-AE (from scratch)







Additive ViT-AE (ours)