AdvGLUE Benchmark

What is AdvGLUE?

Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. It covers five natural language understanding tasks from the famous GLUE tasks and is an adversarial version of GLUE benchmark.

AdvGLUE considers textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples, which provide comprehensive coverage of various adversarial linguistic phenomena.

Explore Statistics and Examples of AdvGLUE AdvGLUE paper

The quality of AdvGLUE benchmark is validated by human annotators. Each adversarial example in AdvGLUE dataset is highly agreed among human annotators. To make sure the annotators fully understand the GLUE tasks, each worker is required to pass a training step to be qualified to work on the main filtering tasks for the generated adversarial examples.

Explore Human Evaluation

News [1/25/2024] We are excited to release our test set with detailed annotations. Please check out our test dataset here. Note that we also include the benign GLUE dev set, with method labeled as glue.

News [3/20/2023] We included an additional dev set containing detailed annotations of our adversarial dataset, which includes adversarial attack method, benign examples, and more. Please check out our new dev dataset here.

Getting Started

We have built a few resources to help you get started with the dataset.

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script we will use for official evaluation, along with a sample prediction file that the script will take as input. To run the evaluation, use python evaluate.py <path_to_dev> <path_to_predictions>.

Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:

Submission Tutorial

Because AdvGLUE is an ongoing effort, we expect the dataset to evolve.

Have Questions?

Ask us questions at boxinw2@illinois.edu and xuchejian@zju.edu.cn.

Acknowledgements

We thank the SQuAD team for allowing us to use their website template and submission tutorials.

Leaderboard

AdvGLUE is an adversarial robustness evaluation benchmark that thoroughly tests and analyzes the vulnerabilities of natural language understanding systems to different adversarial transformations.

Rank	Model	Score
1 Jul 24, 2022	TBD-name (single) CASIA	0.6545
2 Mar 31, 2022	CreAT (single model) SJTU	0.6249
3 Aug 29, 2021	DeBERTa (single model) UIUC	0.6086
4 Aug 29, 2021	ALBERT (single model) UIUC	0.5922
5 Aug 29, 2021	T5 (single model) UIUC	0.5682
6 Aug 29, 2021	SMART_RoBERTa (single model) UIUC	0.5371
7 Aug 29, 2021	FreeLB (single model) UIUC	0.5048
8 Aug 29, 2021	RoBERTa (single model) UIUC	0.5021
9 Aug 29, 2021	InfoBERT (single model) UIUC	0.4603
10 Aug 29, 2021	ELECTRA (single model) UIUC	0.4169
11 Aug 29, 2021	BERT (single model) UIUC	0.3369
12 Aug 29, 2021	SMART_BERT (single model) UIUC	0.3029

AdvGLUE

The Adversarial GLUE Benchmark

What is AdvGLUE?

Getting Started

Have Questions?

Acknowledgements

Leaderboard