Comparative Study on Noise-Augmented Training and its Effect on Adversarial Robustness in ASR Systems

Reassessing Noise Augmentation Methods in the Context of Adversarial Speech

This project is maintained by matiuste

Adversarial Example Demo

Supplementary material containing a selection of benign, adversarial, and noisy data employed in our paper.

For each sample, we include the word error rate (WER) as an accuracy metric and the segmental signal-to-noise ratio (SNRseg) as a quality noise metric. An SNRseg exceeding 0 dB indicates a stronger signal presence compared to noise. These samples are sourced from the Librispeech corpus dataset.

As outlined in our paper we investigated three types of training regimes, resulting in three different models, respectively :

  1. Baseline (no augmentation): This model serves as our control of the clean dataset without any form of augmentation. This baseline establishes the standard level of performance for each model under ideal acoustic conditions.

  2. Augmentation with speed variations: During the training of the second model temporal variability was introduced into the training data by applying speed perturbations. These augmentations simulate natural variations in speech tempo, which can occur due to speaker differences or recording conditions.

  3. Augmentation with speed variations, background noise, and reverberation: The third model is trained with both background noises and reverberations, in addition to speed variations. This combination aims to mimic more challenging and realistic acoustic environments that ASR systems may encounter in real-world applications.

The key takeaway from the following samples is that models trained with augmentation methods tend to be more robust against adversarial attacks. This robustness is demonstrated by two main observations:

Additionally, lower SNRseg​ values for models trained with augmentation suggest that these models require more noise to create effective adversarial examples, indicating a higher degree of robustness.

Please note that the adversarial samples are crafted per model. In the following, we report the WER and SNRseg for the adversarial sample for each model, using the seq2seq model architecture described in our paper. However, we only play the adversarial example generated with respect to model 3 (augmentation with speed variations, background noise, and reverberation).

Experiments for the C&W attack.

Sample 1
Benign transcription:       	    THEY OF COURSE MUST ALL BE ALTERED
Target Adversarial transcription:   LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 1:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 2:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 3:  LOOK AT THAT HE'LL BE OUT OF IT

  Benign:                    Benign + Noise: SNRseg= -4.81
[1: WER=00.00],                 [1: WER=71.43]
[2: WER=00.00],                 [2: WER=57.14]
[3: WER=00.00],                 [3: WER=14.29]

C&W adversarial:
[1: WER=00.00, SNRseg=24.47]
[2: WER=00.00, SNRseg=18.84]
[3: WER=50.00, SNRseg=15.93]

Sample 2
Benign transcription:               TO THEIR SORROW THEY WERE SOON UNDECEIVED
Target Adversarial transcription:   ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 1:  ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 2:  ONE COULD HARDLY HOPE FOR ANY SLEEP A DAY
Adversarial transcription model 3:  ONE COULD HARDLY HOPE FOR ANY ONE WHO HAD

  Benign:                    Benign + Noise: SNRseg= 1.34
[1: WER=00.00],                 [1: WER=57.14]
[2: WER=00.00],                 [2: WER=00.00]
[3: WER=00.00],                 [3: WER=00.00]

C&W adversarial:
[1: WER=00.00, SNRseg=22.20]
[2: WER=27.27, SNRseg=17.04]
[3: WER=45.45, SNRseg=04.13]

Sample 3
Benign transcription:       	    BUT YOU KNOW MORE ABOUT THAT THAN I DO SIR
Target Adversarial transcription:   YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 1:  YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 2:  YES MY DEAR ROGER  I HAVE THE MYSTERY
Adversarial transcription model 3:  YES MY DEAR CHILD I AM AFRAID OF THE ELEMENTS

  Benign:                    Benign + Noise: SNRseg= -0.41
[1: WER=00.00],                 [1: WER=30.00]
[2: WER=00.00],                 [2: WER=10.00]
[3: WER=00.00],                 [3: WER=10.00]

C&W adversarial:
[1: WER=00.00, SNRseg=25.78]
[2: WER=22.22, SNRseg=23.94]
[3: WER=55.56, SNRseg=22.30]

Experiments for the Alzantot attack.

Sample 1
Benign transcription:       	    WHY FADE THESE CHILDREN OF THE SPRING
Adversarial transcription model 1:  LIFE THEY THESE CHILDREN OF THIS EARTH 
Adversarial transcription model 2:  WHY FAITH THESE CHILDREN OF THIS FREELY
Adversarial transcription model 3:  DON'T FEED THESE CHILDREN OF THE SPRING

  Benign:                    Alzantot adversarial:

[1: WER=57.14, SNRseg=13.53]
[2: WER=42.86, SNRseg=13.54]
[3: WER=28.57, SNRseg=13.53]

Sample 2
Benign transcription:       	    THE CLOUD THEN SHEWD HIS GOLDEN HEAD AND HIS BRIGHT FORM EMERG'D
Adversarial transcription model 1:  THE CLOUD THEN SHOWED HE SCOLDED IN HIS RIGHT FORM ALONE
Adversarial transcription model 2:  THE CLOUD THEN SHOWED HE'S GOLDEN IN HIS RIGHT FORMERLY
Adversarial transcription model 3:  THE CLOUDS THEN SHEWED HIS GOLDEN HEAD AND HIS BRIGHT FOREARMS

  Benign:                    Alzantot adversarial:

[1: WER=58.33, SNRseg=11.57]
[2: WER=58.33, SNRseg=11.58]
[3: WER=33.33, SNRseg=11.56]

Experiments for the Kenansville attack.

Sample 1
Benign transcription:               IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEELZEBUB AT ONCE
Adversarial transcription model 1:  IF THERE'S BEEN A LITTLE BIT WILD HE'S BEYOND THE BUBB AT ONCE
Adversarial transcription model 2:  IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BELLS ABOUT AT ONCE
Adversarial transcription model 3:  IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEEN ABOUT AT ONCE

  Benign:                    Kenansville adversarial:

[1: WER=41.67, SNRseg=22.37]
[2: WER=16.67, SNRseg=22.37]
[3: WER=16.67, SNRseg=30.53]

Sample 2
Benign transcription:       	    WE ARE QUITE SATISFIED NOW CAPTAIN BATTLEAX SAID MY WIFE
Adversarial transcription model 1:  WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLE AXE SAID MY WIFE
Adversarial transcription model 2:  WE ARE QUITE SATISFIED NOW CATHERINES SAID MY WIFE
Adversarial transcription model 3:  WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLES SAID MY WIFE

  Benign:                    Kenansville adversarial:

[1: WER=20.00, SNRseg=18.76]
[2: WER=20.00, SNRseg=11.49]
[3: WER=10.00, SNRseg=8.92]