Reassessing Noise Augmentation Methods in the Context of Adversarial Speech
This project is maintained by matiuste
Supplementary material containing a selection of benign, adversarial, and noisy data employed in our paper.
For each sample, we include the word error rate (WER) as an accuracy metric and the segmental signal-to-noise ratio (SNRseg) as a quality noise metric. An SNRseg exceeding 0 dB indicates a stronger signal presence compared to noise. These samples are sourced from the Librispeech corpus dataset.
As outlined in our paper we investigated three types of training regimes, resulting in three different models, respectively :
Baseline (no augmentation): This model serves as our control of the clean dataset without any form of augmentation. This baseline establishes the standard level of performance for each model under ideal acoustic conditions.
Augmentation with speed variations: During the training of the second model temporal variability was introduced into the training data by applying speed perturbations. These augmentations simulate natural variations in speech tempo, which can occur due to speaker differences or recording conditions.
Augmentation with speed variations, background noise, and reverberation: The third model is trained with both background noises and reverberations, in addition to speed variations. This combination aims to mimic more challenging and realistic acoustic environments that ASR systems may encounter in real-world applications.
The key takeaway from the following samples is that models trained with augmentation methods tend to be more robust against adversarial attacks. This robustness is demonstrated by two main observations:
Higher WER for C&W targeted white-box attack: A model is considered more robust if it produces a higher WER when using the target transcription as the reference. This means that the adversarial attack is less successful in forcing the model to recognize the target transcription.
Lower WER for Alzantot and Kenansville untargeted black-box attacks: A model is considered more robust if it produces a lower WER when using the original ground truth as the reference. This indicates that the adversarial attack is less effective at distorting the recognition of the original audio.
Additionally, lower SNRseg values for models trained with augmentation suggest that these models require more noise to create effective adversarial examples, indicating a higher degree of robustness.
Please note that the adversarial samples are crafted per model. In the following, we report the WER and SNRseg for the adversarial sample for each model, using the seq2seq model architecture described in our paper. However, we only play the adversarial example generated with respect to model 3 (augmentation with speed variations, background noise, and reverberation).
Benign transcription: THEY OF COURSE MUST ALL BE ALTERED Target Adversarial transcription: LOOK AT THAT HE HELD OUT HIS HAND Adversarial transcription model 1: LOOK AT THAT HE HELD OUT HIS HAND Adversarial transcription model 2: LOOK AT THAT HE HELD OUT HIS HAND Adversarial transcription model 3: LOOK AT THAT HE'LL BE OUT OF IT
Benign: Benign + Noise: SNRseg= -4.81
[1: WER=00.00], [1: WER=71.43]
[2: WER=00.00], [2: WER=57.14]
[3: WER=00.00], [3: WER=14.29]
C&W adversarial:
[1: WER=00.00, SNRseg=24.47]
[2: WER=00.00, SNRseg=18.84]
[3: WER=50.00, SNRseg=15.93]
Benign transcription: TO THEIR SORROW THEY WERE SOON UNDECEIVED Target Adversarial transcription: ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY Adversarial transcription model 1: ONE COULD HARDLY HOPE FOR ANY UPON SO DRY A DAY Adversarial transcription model 2: ONE COULD HARDLY HOPE FOR ANY SLEEP A DAY Adversarial transcription model 3: ONE COULD HARDLY HOPE FOR ANY ONE WHO HAD
Benign: Benign + Noise: SNRseg= 1.34
[1: WER=00.00], [1: WER=57.14]
[2: WER=00.00], [2: WER=00.00]
[3: WER=00.00], [3: WER=00.00]
C&W adversarial:
[1: WER=00.00, SNRseg=22.20]
[2: WER=27.27, SNRseg=17.04]
[3: WER=45.45, SNRseg=04.13]
Benign transcription: BUT YOU KNOW MORE ABOUT THAT THAN I DO SIR Target Adversarial transcription: YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY Adversarial transcription model 1: YES MY DEAR WATSON I HAVE SOLVED THE MYSTERY Adversarial transcription model 2: YES MY DEAR ROGER I HAVE THE MYSTERY Adversarial transcription model 3: YES MY DEAR CHILD I AM AFRAID OF THE ELEMENTS
Benign: Benign + Noise: SNRseg= -0.41
[1: WER=00.00], [1: WER=30.00]
[2: WER=00.00], [2: WER=10.00]
[3: WER=00.00], [3: WER=10.00]
C&W adversarial:
[1: WER=00.00, SNRseg=25.78]
[2: WER=22.22, SNRseg=23.94]
[3: WER=55.56, SNRseg=22.30]
Benign transcription: WHY FADE THESE CHILDREN OF THE SPRING Adversarial transcription model 1: LIFE THEY THESE CHILDREN OF THIS EARTH Adversarial transcription model 2: WHY FAITH THESE CHILDREN OF THIS FREELY Adversarial transcription model 3: DON'T FEED THESE CHILDREN OF THE SPRING
Benign: Alzantot adversarial:
[1: WER=57.14, SNRseg=13.53]
[2: WER=42.86, SNRseg=13.54]
[3: WER=28.57, SNRseg=13.53]
Benign transcription: THE CLOUD THEN SHEWD HIS GOLDEN HEAD AND HIS BRIGHT FORM EMERG'D Adversarial transcription model 1: THE CLOUD THEN SHOWED HE SCOLDED IN HIS RIGHT FORM ALONE Adversarial transcription model 2: THE CLOUD THEN SHOWED HE'S GOLDEN IN HIS RIGHT FORMERLY Adversarial transcription model 3: THE CLOUDS THEN SHEWED HIS GOLDEN HEAD AND HIS BRIGHT FOREARMS
Benign: Alzantot adversarial:
[1: WER=58.33, SNRseg=11.57]
[2: WER=58.33, SNRseg=11.58]
[3: WER=33.33, SNRseg=11.56]
Benign transcription: IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEELZEBUB AT ONCE Adversarial transcription model 1: IF THERE'S BEEN A LITTLE BIT WILD HE'S BEYOND THE BUBB AT ONCE Adversarial transcription model 2: IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BELLS ABOUT AT ONCE Adversarial transcription model 3: IF A FELLOW'S BEEN A LITTLE BIT WILD HE'S BEEN ABOUT AT ONCE
Benign: Kenansville adversarial:
[1: WER=41.67, SNRseg=22.37]
[2: WER=16.67, SNRseg=22.37]
[3: WER=16.67, SNRseg=30.53]
Benign transcription: WE ARE QUITE SATISFIED NOW CAPTAIN BATTLEAX SAID MY WIFE Adversarial transcription model 1: WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLE AXE SAID MY WIFE Adversarial transcription model 2: WE ARE QUITE SATISFIED NOW CATHERINES SAID MY WIFE Adversarial transcription model 3: WE ARE QUITE SATISFIED NOW CAPTAIN BOTTLES SAID MY WIFE
Benign: Kenansville adversarial:
[1: WER=20.00, SNRseg=18.76]
[2: WER=20.00, SNRseg=11.49]
[3: WER=10.00, SNRseg=8.92]