Kenji Kawaguchi

Unsupervised Concept Discovery Mitigates Spurious Correlations

Md Rifat Arefin

Yang Zhang

Aristide Baratin

Francesco Locatello

Irina Rish

Dianbo Liu

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

Learning diverse attacks on large language models for robust red-teaming and safety tuning

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (preprint)

Learning diverse attacks on large language models for robust red-teaming and safety tuning

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (preprint)

Learning diverse attacks on large language models for robust red-teaming and safety tuning

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (preprint)

Learning diverse attacks on large language models for robust red-teaming and safety tuning

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (see more)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (preprint)

Discrete Key-Value Bottleneck

Frederik Träuble

Anirudh Goyal

Nasim Rahaman

Michael Curtis Mozer

Yoshua Bengio

Bernhard Schölkopf

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization

Dianbo Liu

Alex Lamb

Xu Ji

Pascal Notsawo

Michael Curtis Mozer

Yoshua Bengio

2023-06-26

Proceedings of the AAAI Conference on Artificial Intelligence (published)

Simplicial Embeddings in Self-Supervised Learning and Downstream Classification

Samuel Lavoie

Simplicial Embeddings (SEM) are representations learned through self-supervised learning (SSL), wherein a representation is projected into …

2023-02-01

ICLR.cc/2023/Conference (notable)

GFlowOut: Dropout with Generative Flow Networks

Dianbo Liu

Moksh J. Jain

Bonaventure F. P. Dossou

Qianli Shen

2023-01-01

ICML (published)

GFlowOut: Dropout with Generative Flow Networks

Dianbo Liu

Moksh J. Jain

Bonaventure F. P. Dossou

Qianli Shen

2023-01-01

ICML (published)

MixupE: Understanding and Improving Mixup from Directional Derivative Perspective

Vikas Verma

Yingtian Zou

Sarthak Mittal

Wai Hoh Tang

Hieu Pham

Juho Kannala

Yoshua Bengio

Arno Solin

Mixup is a popular data augmentation technique for training deep neural networks where additional samples are generated by linearly interpol… (see more)ating pairs of inputs and their labels. This technique is known to improve the generalization performance in many learning paradigms and applications. In this work, we first analyze Mixup and show that it implicitly regularizes infinitely many directional derivatives of all orders. Based on this new insight, we propose an improved version of Mixup, theoretically justified to deliver better generalization performance than the vanilla Mixup. To demonstrate the effectiveness of the proposed method, we conduct experiments across various domains such as images, tabular data, speech, and graphs. Our results show that the proposed method improves Mixup across multiple datasets using a variety of architectures, for instance, exhibiting an improvement over Mixup by 0.8% in ImageNet top-1 accuracy.

2023-01-01

UAI (published)