Adversarial GLUE Benchmark (AdvGLUE) is a comprehensive robustness evaluation benchmark that focuses on the adversarial robustness evaluation of language models. It covers five natural language understanding tasks from the famous GLUE tasks and is an adversarial version of GLUE benchmark.
AdvGLUE considers textual adversarial attacks from different perspectives and hierarchies, including word-level transformations, sentence-level manipulations, and human-written adversarial examples, which provide comprehensive coverage of various adversarial linguistic phenomena.
Explore Statistics and Examples of AdvGLUE AdvGLUE paperThe quality of AdvGLUE benchmark is validated by human annotators. Each adversarial example in AdvGLUE dataset is highly agreed among human annotators. To make sure the annotators fully understand the GLUE tasks, each worker is required to pass a training step to be qualified to work on the main filtering tasks for the generated adversarial examples.
Explore Human Evaluation News [1/25/2024] We are excited to release our test set with detailed annotations. Please check out our test dataset here. Note that we also include the benign GLUE dev set, with method
labeled as glue
.
News [3/20/2023] We included an additional dev set containing detailed annotations of our adversarial dataset, which includes adversarial attack method, benign examples, and more. Please check out our new dev dataset here.
We have built a few resources to help you get started with the dataset.
Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
To evaluate your models, we have also made available the evaluation script we will use
for official evaluation, along with a sample prediction file that the script will take as input.
To run the evaluation, use python evaluate.py <path_to_dev> <path_to_predictions>
.
Once you have a built a model that works to your expectations on the dev set, you submit it to get official scores on the dev and a hidden test set. To preserve the integrity of test results, we do not release the test set to the public. Instead, we require you to submit your model so that we can run it on the test set for you. Here's a tutorial walking you through official evaluation of your model:
Submission TutorialBecause AdvGLUE is an ongoing effort, we expect the dataset to evolve.
Ask us questions at boxinw2@illinois.edu and xuchejian@zju.edu.cn.
We thank the SQuAD team for allowing us to use their website template and submission tutorials.
AdvGLUE is an adversarial robustness evaluation benchmark that thoroughly tests and analyzes the vulnerabilities of natural language understanding systems to different adversarial transformations.
Rank | Model | Score |
---|---|---|
1 Jul 24, 2022 |
TBD-name
(single)
CASIA |
0.6545 |
2 Mar 31, 2022 |
CreAT
(single model)
SJTU |
0.6249 |
3 Aug 29, 2021 |
DeBERTa
(single model)
UIUC |
0.6086 |
4 Aug 29, 2021 |
ALBERT
(single model)
UIUC |
0.5922 |
5 Aug 29, 2021 |
T5 (single
model)
UIUC |
0.5682 |
6 Aug 29, 2021 |
SMART_RoBERTa
(single model)
UIUC |
0.5371 |
7 Aug 29, 2021 |
FreeLB
(single model)
UIUC |
0.5048 |
8 Aug 29, 2021 |
RoBERTa
(single model)
UIUC |
0.5021 |
9 Aug 29, 2021 |
InfoBERT
(single model)
UIUC |
0.4603 |
10 Aug 29, 2021 |
ELECTRA
(single model)
UIUC |
0.4169 |
11 Aug 29, 2021 |
BERT
(single model)
UIUC |
0.3369 |
12 Aug 29, 2021 |
SMART_BERT
(single model)
UIUC |
0.3029 |