This is the demo page for the paper "Extending Audio Masked Autoencoders toward Audio Restoration".
It is highly recommended using a headphone or earphone when listening to these samples.
We provide several samples from AudioSet and LibriTTS to briefly illustrate features of these datasets.
AudioSet contains around 2 million 10-second (some are shorter than 10s) audio segments taken from YouTube, annotated with 527 diverse classes. AudioSet has been widely used in general audio representation learning.
Sample | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | Sample 6 | Sample 7 | Sample 8 |
Audio |
LibriTTS is a large-scale corpus of English audiobooks presented at sentence break.
Sample | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | Sample 6 | Sample 7 | Sample 8 |
Audio |
We provide several test files from AudioSet and SPCv2 to demonstrate how our models behave in these classification tasks.
The dataset presents a 35-class single-label speech command recognition task. The top-1 accuracy in test set is 97.8% for both LibriTTS and AudioSet pretraining.
Label | Backward | Cat | Dog | Eight | Five | Go | Happy | Learn |
Test Audio | ||||||||
Prediction | Backward | Cat | Dog | Eight | Five | Go | Happy | Learn |
AudioSet presents a 527-class multi-label tasks, where various sound sources, events and scenarios are included. Models are finetuned in AudioSet with two settings: AudioSet-2M (summation of unbalanced and balanced subsets) and AudioSet-20k (balance subset alone). We use thresholds optimized on test set itself to produce multi-label decisions.
Sample | Sample 1 | Sample 2 | Sample 3 | Sample 4 | Sample 5 | Sample 6 | Sample 7 | Sample 8 |
Labels | Animal;"Domestic animals, pets";Dog;Squeak | Hammer | Speech;Female singing;Child singing;Music | Speech;"Walk, footsteps";Music | Music,Theme music,Christmas music | Speech,Animal,Horse,"Neigh, whinny",Music | Thunderstorm,Thunder,Rain,Rain on surface | Emergency vehicle,Ambulance (siren),Siren |
Audio | ||||||||
AS-2M Prediction | Whimper,Animal,"Domestic animals, pets",Dog,Whimper (dog) | Hammer | Speech;"Child speech, kid speaking";Whimper;Children playing;"Inside, large room or hall";"Outside, urban or manmade" | Speech;Music | Music,New-age music,Background music,Theme music,Sad music,Tender music | Speech,Animal,Horse,"Neigh, whinny",Music | Thunderstorm,Thunder,Rain,Raindrop,Rain on surface | Motor vehicle (road),Emergency vehicle,Ambulance (siren),"Fire engine, fire truck (siren)",Siren |
AS-20K Prediction | Whimper;Theremin | No tag is predicted | Speech;"Child speech, kid speaking";Whimperl;Children playing;Music;"Inside, large room or hall";"Outside, urban or manmade" | Music;Mandolin | Music;Background music;Theme music;Soundtrack music;Sad music | Speech;Horse;"Neigh, whinny";Music | Thunderstorm;Thunder;Rain;Raindrop;Rain on surface;Rustle;White noise | Vehicle;Motor vehicle (road);"Vehicle horn, car horn, honking";Truck;Emergency vehicle;Ambulance (siren);Siren |
We provide several test files from the standard Valentini’s dataset, which presents a test set of 824 noisy speeches without reverberation.
Audio samples for Table. 2 in the paper.
Pretrain | From-scratch | LibriTTS | AudioSet | |||||||||
Method | Noisy | Vanilla ViT-AE | Additive | Multiplicative | Vanilla | Additive | Multiplicative | Vanilla | Additive | Multiplicative | Vocoder Oracle | Clean |
p232_005 | ||||||||||||
p232_010 | ||||||||||||
p257_003 | ||||||||||||
p257_008 |
Audio samples for Table. 3 in the paper.
Pretrain | From-scratch | LibriTTS | AudioSet | ||||
Method | Noisy | ViT decoder with STFT input | ViT-AE-iSTFT | ViT-AE-iSTFT | ViT-AE-iSTFT | Multiplicative ViT-AE | Clean |
p232_005 | |||||||
p232_010 | |||||||
p257_003 | |||||||
p257_008 |
Audio samples for Table. 4 in the paper. As mentioned in Sec.4.1 in the paper, our system works at the sampling rate of 22.05 kHz for the compatibility with the vocoder. We therefore offer the resampled audio samples of ViT-AE to compare our models with existing diffusion models.
Please notice that these state-of-the-art diffusion models does not work for other tasks such as classification. We leave the improvement of ViT-AE’s objective scores to future work.
Comment | ViT-AEs Pretrained on AudioSet) | Diffusion | |||||
Method | Noisy | ViT-AE-iSTFT (ours) | Multiplicative ViT-AE (ours) | Vocoder Oracle | UNIVERSE (unofficial) | SGMSE+ | Clean |
p232_005 22k | - | - | |||||
p232_005 16k | |||||||
p257_003 22k | - | - | |||||
p257_003 16k |
Audio samples for Table .5 in the paper.
It is worth noticing that, the Multiplicative ViT-AE pretrained on AudioSet is worse than UNIVERSE in Table. 4, but is better in Table. 5, showing its generalization ability to out-of-domain distortions.
Comment | ViT-AEs Pretrained on AudioSet) | Diffusion | |||||
Method | Noisy | ViT-AE-iSTFT (ours) | Multiplicative ViT-AE (ours) | Vocoder Oracle | UNIVERSE (unofficial) | SGMSE+ | Clean |
f10_script5_ipad_balcony1 22k | - | - | |||||
f10_script5_ipad_balcony1 16k | |||||||
f10_script5_ipadflat_confroom1 22k | - | - | |||||
f10_script5_ipadflat_confroom1 16k |
Bandwidth extension (BWE) is also an important restoration task. As mentioned in the paper, pretraining with AudioSet makes the ViT-AE compatible non-speech tasks. However, we want to mention that for music BWE there is no widely adopted benchmark, so we defined our own experimental settings, where the narrowband input at the sampling rate of 11 kHz is converted to the sampling rate of 22 kHz.
We decided only to mention this experiment in this Demo Page as a bonus session. There are 8 sample tracks in the demo, each of which is 30-second long.
We report the log spectral distance (LSD) scores of the samples here: input = 4.02, UNet (our training of [6]) with HiFi-GAN = 2.22, additive ViT-AE (scratch) = 0.96, additive ViT-AE (AudioSet pretrain, proposed) = 0.89 (best).
Track in FMA | 001381 | 011019 | 013706 | 014653 | 076363 | 085307 | 110743 | 113262 |
11k input | ||||||||
UNet [6] + HiFI-GAN | ||||||||
Additive ViT-AE (from scratch) | ||||||||
Additive ViT-AE (ours) |