Adversarial Example Demo

Supplementary material containing a selection of benign, adversarial, and noisy data employed in our paper.

For each sample, we include the word error rate (WER) as an accuracy metric and the segmental signal-to-noise ratio (SNR_seg) as a quality noise metric. An SNR_seg exceeding 0 dB indicates a stronger signal presence compared to noise. These samples are sourced from the Librispeech corpus dataset.

As outlined in our paper we investigated three types of training regimes, resulting in three different models, respectively :

Baseline (no augmentation): This model serves as our control of the clean dataset without any form of augmentation. This baseline establishes the standard level of performance for each model under ideal acoustic conditions.
Augmentation with speed variations: During the training of the second model temporal variability was introduced into the training data by applying speed perturbations. These augmentations simulate natural variations in speech tempo, which can occur due to speaker differences or recording conditions.
Augmentation with speed variations, background noise, and reverberation: The third model is trained with both background noises and reverberations, in addition to speed variations. This combination aims to mimic more challenging and realistic acoustic environments that ASR systems may encounter in real-world applications.

The key takeaway from the following samples is that models trained with augmentation methods tend to be more robust against adversarial attacks. This robustness is demonstrated by two main observations:

Higher WER for C&W targeted white-box attack: A model is considered more robust if it produces a higher WER when using the target transcription as the reference. This means that the adversarial attack is less successful in forcing the model to recognize the target transcription.
Lower WER for Alzantot and Kenansville untargeted black-box attacks: A model is considered more robust if it produces a lower WER when using the original ground truth as the reference. This indicates that the adversarial attack is less effective at distorting the recognition of the original audio.

Additionally, lower SNR_seg values for models trained with augmentation suggest that these models require more noise to create effective adversarial examples, indicating a higher degree of robustness.

Please note that the adversarial samples are crafted per model. In the following, we report the WER and SNR_seg for the adversarial sample for each model, using the seq2seq model architecture described in our paper. However, we only play the adversarial example generated with respect to model 3 (augmentation with speed variations, background noise, and reverberation).

Experiments for the C&W attack.

Sample 1

Benign transcription:       	    THEY OF COURSE MUST ALL BE ALTERED
Target Adversarial transcription:   LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 1:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 2:  LOOK AT THAT HE HELD OUT HIS HAND
Adversarial transcription model 3:  LOOK AT THAT HE'LL BE OUT OF IT

Benign: Benign + Noise: SNR_seg= -4.81
[1: WER=00.00], [1: WER=71.43]
[2: WER=00.00], [2: WER=57.14]
[3: WER=00.00], [3: WER=14.29]

C&W adversarial:
[1: WER=00.00, SNR_seg=24.47]
[2: WER=00.00, SNR_seg=18.84]
[3: WER=50.00, SNR_seg=15.93]

Sample 2

Benign transcription:               TO THEIR SORROW THEY WERE SOON UNDECEIVED
Target Adversarial transcription:   ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 1:  ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY
Adversarial transcription model 2:  ONE COULD HARDLY HOPE FOR ANY SLEEP A DAY
Adversarial transcription model 3:  ONE COULD HARDLY HOPE FOR ANY ONE WHO HAD

Benign: Benign + Noise: SNR_seg= 1.34
[1: WER=00.00], [1: WER=57.14]
[2: WER=00.00], [2: WER=00.00]
[3: WER=00.00], [3: WER=00.00]

C&W adversarial:
[1: WER=00.00, SNR_seg=22.20]
[2: WER=27.27, SNR_seg=17.04]
[3: WER=45.45, SNR_seg=04.13]

Sample 3

Benign transcription:       	    BUT YOU KNOW MORE ABOUT THAT THAN I DO SIR
Target Adversarial transcription:   YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 1:  YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY
Adversarial transcription model 2:  YES MY DEAR ROGER  I HAVE THE MYSTERY
Adversarial transcription model 3:  YES MY DEAR CHILD I AM AFRAID OF THE ELEMENTS

Benign: Benign + Noise: SNR_seg= -0.41
[1: WER=00.00], [1: WER=30.00]
[2: WER=00.00], [2: WER=10.00]
[3: WER=00.00], [3: WER=10.00]

C&W adversarial:
[1: WER=00.00, SNR_seg=25.78]
[2: WER=22.22, SNR_seg=23.94]
[3: WER=55.56, SNR_seg=22.30]

Experiments for the Alzantot attack.

Sample 1

Benign transcription:       	    WHY FADE THESE CHILDREN OF THE SPRING
Adversarial transcription model 1:  LIFE THEY THESE CHILDREN OF THIS EARTH 
Adversarial transcription model 2:  WHY FAITH THESE CHILDREN OF THIS FREELY
Adversarial transcription model 3:  DON'T FEED THESE CHILDREN OF THE SPRING

Benign: Alzantot adversarial:

[1: WER=57.14, SNR_seg=13.53]
[2: WER=42.86, SNR_seg=13.54]
[3: WER=28.57, SNR_seg=13.53]

Sample 2

Benign transcription:       	    THE CLOUD THEN SHEWD HIS GOLDEN HEAD AND HIS BRIGHT FORM EMERG'D
Adversarial transcription model 1:  THE CLOUD THEN SHOWED HE SCOLDED IN HIS RIGHT FORM ALONE
Adversarial transcription model 2:  THE CLOUD THEN SHOWED HE'S GOLDEN IN HIS RIGHT FORMERLY
Adversarial transcription model 3:  THE CLOUDS THEN SHEWED HIS GOLDEN HEAD AND HIS BRIGHT FOREARMS

Benign: Alzantot adversarial:

[1: WER=58.33, SNR_seg=11.57]
[2: WER=58.33, SNR_seg=11.58]
[3: WER=33.33, SNR_seg=11.56]

Experiments for the Kenansville attack.

Sample 1

Benign transcription:               IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEELZEBUB AT ONCE
Adversarial transcription model 1:  IF THERE'S BEEN A LITTLE BIT WILD HE'S BEYOND THE BUBB AT ONCE
Adversarial transcription model 2:  IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BELLS ABOUT AT ONCE
Adversarial transcription model 3:  IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEEN ABOUT AT ONCE

Benign: Kenansville adversarial:

[1: WER=41.67, SNR_seg=22.37]
[2: WER=16.67, SNR_seg=22.37]
[3: WER=16.67, SNR_seg=30.53]

Sample 2

Benign transcription:       	    WE ARE QUITE SATISFIED NOW CAPTAIN BATTLEAX SAID MY WIFE
Adversarial transcription model 1:  WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLE AXE SAID MY WIFE
Adversarial transcription model 2:  WE ARE QUITE SATISFIED NOW CATHERINES SAID MY WIFE
Adversarial transcription model 3:  WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLES SAID MY WIFE

Benign: Kenansville adversarial:

[1: WER=20.00, SNR_seg=18.76]
[2: WER=20.00, SNR_seg=11.49]
[3: WER=10.00, SNR_seg=8.92]