Reassessing Noise Augmentation Methods in the Context of Adversarial Speech

[SPSC 2024 paper] We investigate whether noise-augmented training can concurrently improve adversarial robustness in ASR systems

This project is maintained by matiuste

Adversarial Example Demo

Supplementary material containing a selection of benign, adversarial, and noisy data employed in our paper.

For each sample, we include the word error rate (WER) as an accuracy metric and the segmental signal-to-noise ratio (SNRseg) as a quality noise metric. An SNRseg exceeding 0 dB indicates a stronger signal presence compared to noise. These samples are sourced from the Librispeech corpus dataset.

As outlined in our paper we investigated three types of training regimes, resulting in three different models, respectively :

  1. Baseline (no augmentation): This model serves as our control of the clean dataset without any form of augmentation. This baseline establishes the standard level of performance for each model under ideal acoustic conditions.

  2. Augmentation with speed variations: During the training of the second model temporal variability was introduced into the training data by applying speed perturbations. These augmentations simulate natural variations in speech tempo, which can occur due to speaker differences or recording conditions.

  3. Augmentation with speed variations, background noise, and reverberation: The third model is trained with both background noises and reverberations, in addition to speed variations. This combination aims to mimic more challenging and realistic acoustic environments that ASR systems may encounter in real-world applications.

The key takeaway from the following samples is that models trained with augmentation methods tend to be more robust against adversarial attacks. This robustness is demonstrated by two main observations: A model is considered more robust if it produces a higher WER when using the target transcription as the reference. This means that the adversarial attack is less successful in forcing the model to recognize the target transcription. Additionally, lower SNRseg​ values for models trained with augmentation suggest that these models require more noise to create effective adversarial examples, indicating a higher degree of robustness.

Please note that the adversarial samples are crafted per model. In the following, we report the WER and SNRseg for the adversarial sample for each model, using the seq2seq model architecture described in our paper. However, we only play the adversarial example generated with respect to model 3 (augmentation with speed variations, background noise, and reverberation).

Experiments for the C&W attack.

Sample 1
Benign transcription:       	    THEY OF COURSE MUST ALL BE ALTERED
Target Adversarial transcription:   LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 1:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 2:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 3:  LOOK AT THAT HE'LL BE OUT OF IT

  Benign:                    Benign + Noise: SNRseg= -4.81
[1: WER=00.00],                 [1: WER=71.43]
[2: WER=00.00],                 [2: WER=57.14]
[3: WER=00.00],                 [3: WER=14.29]

C&W adversarial:
[1: WER=00.00, SNRseg=24.47]
[2: WER=00.00, SNRseg=18.84]
[3: WER=50.00, SNRseg=15.93]

Sample 2
Benign transcription:               TO THEIR SORROW THEY WERE SOON UNDECEIVED
Target Adversarial transcription:   ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 1:  ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 2:  ONE COULD HARDLY HOPE FOR ANY SLEEP A DAY
Adversarial transcription model 3:  ONE COULD HARDLY HOPE FOR ANY ONE WHO HAD

  Benign:                    Benign + Noise: SNRseg= 1.34
[1: WER=00.00],                 [1: WER=57.14]
[2: WER=00.00],                 [2: WER=00.00]
[3: WER=00.00],                 [3: WER=00.00]

C&W adversarial:
[1: WER=00.00, SNRseg=22.20]
[2: WER=27.27, SNRseg=17.04]
[3: WER=45.45, SNRseg=04.13]

Sample 3
Benign transcription:       	    BUT YOU KNOW MORE ABOUT THAT THAN I DO SIR
Target Adversarial transcription:   YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 1:  YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 2:  YES MY DEAR ROGER  I HAVE THE MYSTERY
Adversarial transcription model 3:  YES MY DEAR CHILD I AM AFRAID OF THE ELEMENTS

  Benign:                    Benign + Noise: SNRseg= -0.41
[1: WER=00.00],                 [1: WER=30.00]
[2: WER=00.00],                 [2: WER=10.00]
[3: WER=00.00],                 [3: WER=10.00]

C&W adversarial:
[1: WER=00.00, SNRseg=25.78]
[2: WER=22.22, SNRseg=23.94]
[3: WER=55.56, SNRseg=22.30]