Demo: Extend Audio MAE toward Audio Restoration

This is the demo page for the paper "Extending Audio Masked Autoencoders toward Audio Restoration".

It is highly recommended using a headphone or earphone when listening to these samples.

Demo: Extend Audio MAE toward Audio Restoration

Pretraining Dataset

We provide several samples from AudioSet and LibriTTS to briefly illustrate features of these datasets.

AudioSet

AudioSet contains around 2 million 10-second (some are shorter than 10s) audio segments taken from YouTube, annotated with 527 diverse classes. AudioSet has been widely used in general audio representation learning.

Sample	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8
Audio

LibriTTS

LibriTTS is a large-scale corpus of English audiobooks presented at sentence break.

Sample	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8
Audio

Audio Classification

We provide several test files from AudioSet and SPCv2 to demonstrate how our models behave in these classification tasks.

Speech Command Ver.2

The dataset presents a 35-class single-label speech command recognition task. The top-1 accuracy in test set is 97.8% for both LibriTTS and AudioSet pretraining.

Label	Backward	Cat	Dog	Eight	Five	Go	Happy	Learn
Test Audio
Prediction	Backward	Cat	Dog	Eight	Five	Go	Happy	Learn

AudioSet

AudioSet presents a 527-class multi-label tasks, where various sound sources, events and scenarios are included. Models are finetuned in AudioSet with two settings: AudioSet-2M (summation of unbalanced and balanced subsets) and AudioSet-20k (balance subset alone). We use thresholds optimized on test set itself to produce multi-label decisions.

Sample	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	Sample 7	Sample 8
Labels	Animal;"Domestic animals, pets";Dog;Squeak	Hammer	Speech;Female singing;Child singing;Music	Speech;"Walk, footsteps";Music	Music,Theme music,Christmas music	Speech,Animal,Horse,"Neigh, whinny",Music	Thunderstorm,Thunder,Rain,Rain on surface	Emergency vehicle,Ambulance (siren),Siren
Audio
AS-2M Prediction	Whimper,Animal,"Domestic animals, pets",Dog,Whimper (dog)	Hammer	Speech;"Child speech, kid speaking";Whimper;Children playing;"Inside, large room or hall";"Outside, urban or manmade"	Speech;Music	Music,New-age music,Background music,Theme music,Sad music,Tender music	Speech,Animal,Horse,"Neigh, whinny",Music	Thunderstorm,Thunder,Rain,Raindrop,Rain on surface	Motor vehicle (road),Emergency vehicle,Ambulance (siren),"Fire engine, fire truck (siren)",Siren
AS-20K Prediction	Whimper;Theremin	No tag is predicted	Speech;"Child speech, kid speaking";Whimperl;Children playing;Music;"Inside, large room or hall";"Outside, urban or manmade"	Music;Mandolin	Music;Background music;Theme music;Soundtrack music;Sad music	Speech;Horse;"Neigh, whinny";Music	Thunderstorm;Thunder;Rain;Raindrop;Rain on surface;Rustle;White noise	Vehicle;Motor vehicle (road);"Vehicle horn, car horn, honking";Truck;Emergency vehicle;Ambulance (siren);Siren

Speech Enhancement: Valentini's Dataset

We provide several test files from the standard Valentini’s dataset, which presents a test set of 824 noisy speeches without reverberation.

From-scratch vs. Pretrained

Audio samples for Table. 2 in the paper.

Pretrain

From-scratch

LibriTTS

AudioSet

Method

Noisy

Vanilla ViT-AE

Additive

Multiplicative

Vanilla

Additive

Multiplicative

Vanilla

Additive

Multiplicative

Vocoder Oracle

Clean

p232_005

p232_010

p257_003

p257_008

ViT-AE-iSTFT vs. Mel-to-mel ViT-AE

Audio samples for Table. 3 in the paper.

Pretrain		From-scratch		LibriTTS	AudioSet
Method	Noisy	ViT decoder with STFT input	ViT-AE-iSTFT	ViT-AE-iSTFT	ViT-AE-iSTFT	Multiplicative ViT-AE	Clean
p232_005
p232_010
p257_003
p257_008

Comparison with Existing Methods

Audio samples for Table. 4 in the paper. As mentioned in Sec.4.1 in the paper, our system works at the sampling rate of 22.05 kHz for the compatibility with the vocoder. We therefore offer the resampled audio samples of ViT-AE to compare our models with existing diffusion models.

Please notice that these state-of-the-art diffusion models does not work for other tasks such as classification. We leave the improvement of ViT-AE’s objective scores to future work.

Comment		ViT-AEs Pretrained on AudioSet)			Diffusion
Method	Noisy	ViT-AE-iSTFT (ours)	Multiplicative ViT-AE (ours)	Vocoder Oracle	UNIVERSE (unofficial)	SGMSE+	Clean
p232_005 22k					-	-
p232_005 16k
p257_003 22k					-	-
p257_003 16k

Speech Enhancement: Out-of-domain DAPS Dataset

Audio samples for Table .5 in the paper.

It is worth noticing that, the Multiplicative ViT-AE pretrained on AudioSet is worse than UNIVERSE in Table. 4, but is better in Table. 5, showing its generalization ability to out-of-domain distortions.

Comment		ViT-AEs Pretrained on AudioSet)			Diffusion
Method	Noisy	ViT-AE-iSTFT (ours)	Multiplicative ViT-AE (ours)	Vocoder Oracle	UNIVERSE (unofficial)	SGMSE+	Clean
f10_script5_ipad_balcony1 22k					-	-
f10_script5_ipad_balcony1 16k
f10_script5_ipadflat_confroom1 22k					-	-
f10_script5_ipadflat_confroom1 16k

Bonus: Music Bandwidth Extension

Bandwidth extension (BWE) is also an important restoration task. As mentioned in the paper, pretraining with AudioSet makes the ViT-AE compatible non-speech tasks. However, we want to mention that for music BWE there is no widely adopted benchmark, so we defined our own experimental settings, where the narrowband input at the sampling rate of 11 kHz is converted to the sampling rate of 22 kHz.

We decided only to mention this experiment in this Demo Page as a bonus session. There are 8 sample tracks in the demo, each of which is 30-second long.

We report the log spectral distance (LSD) scores of the samples here: input = 4.02, UNet (our training of [6]) with HiFi-GAN = 2.22, additive ViT-AE (scratch) = 0.96, additive ViT-AE (AudioSet pretrain, proposed) = 0.89 (best).

Track in FMA	001381	011019	013706	014653	076363	085307	110743	113262
11k input
UNet [6] + HiFI-GAN
Additive ViT-AE (from scratch)
Additive ViT-AE (ours)

Demo: Extend Audio MAE toward Audio Restoration

Table of contents

Pretraining Dataset

AudioSet

LibriTTS

Audio Classification

Speech Command Ver.2

AudioSet

Speech Enhancement: Valentini's Dataset

From-scratch vs. Pretrained

ViT-AE-iSTFT vs. Mel-to-mel ViT-AE

Comparison with Existing Methods

Speech Enhancement: Out-of-domain DAPS Dataset

Bonus: Music Bandwidth Extension