AdvGLUE

The Adversarial GLUE Benchmark

What is AdvGLUE?

Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. It covers five natural language understanding tasks from the famous GLUE tasks and is an adversarial version of GLUE benchmark.


AdvGLUE considers textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples, which provide comprehensive coverage of various adversarial linguistic phenomena.

Explore Statistics and Examples of AdvGLUE AdvGLUE paper

The quality of AdvGLUE benchmark is validated by human annotators. Each adversarial example in AdvGLUE dataset is highly agreed among human annotators. To make sure the annotators fully understand the GLUE tasks, each worker is required to pass a training step to be qualified to work on the main filtering tasks for the generated adversarial examples.

Explore Human Evaluation

News [1/25/2024] We are excited to release our test set with detailed annotations. Please check out our test dataset here. Note that we also include the benign GLUE dev set, with method labeled as glue.

News [3/20/2023] We included an additional dev set containing detailed annotations of our adversarial dataset, which includes adversarial attack method, benign examples, and more. Please check out our new dev dataset here.

Getting Started

We have built a few resources to help you get started with the dataset.

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evaluate.py <path_to_dev> <path_to_predictions>.

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Because AdvGLUE is an ongoing effort, we expect the dataset to evolve.

Have Questions?

Ask us questions at boxinw2@illinois.edu and xuchejian@zju.edu.cn.

Acknowledgements

We thank the SQuAD team for allowing us to use their website template and submission tutorials.

Leaderboard

AdvGLUE is an adversarial robustness evaluation benchmark that thoroughly tests and analyzes the vulnerabilities of natural language understanding systems to different adversarial transformations.

Rank Model Score

1

Jul 24, 2022
TBD-name (single)

CASIA

0.6545

2

Mar 31, 2022
CreAT (single model)

SJTU

0.6249

3

Aug 29, 2021
DeBERTa (single model)

UIUC

0.6086

4

Aug 29, 2021
ALBERT (single model)

UIUC

0.5922

5

Aug 29, 2021
T5 (single model)

UIUC

0.5682

6

Aug 29, 2021
SMART_RoBERTa (single model)

UIUC

0.5371

7

Aug 29, 2021
FreeLB (single model)

UIUC

0.5048

8

Aug 29, 2021
RoBERTa (single model)

UIUC

0.5021

9

Aug 29, 2021
InfoBERT (single model)

UIUC

0.4603

10

Aug 29, 2021
ELECTRA (single model)

UIUC

0.4169

11

Aug 29, 2021
BERT (single model)

UIUC

0.3369

12

Aug 29, 2021
SMART_BERT (single model)

UIUC

0.3029