Kenji Kawaguchi

Learning Diverse Attacks on Large Language Models for Robust Red-Teaming and Safety Tuning

Juho Lee

Sung Ju Hwang

Nikolay Malkin

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (voir plus)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2025-01-21

International Conference on Learning Representations (poster)

doi.org

openreview.net

Unsupervised Concept Discovery Mitigates Spurious Correlations

Md Rifat Arefin

Francesco Locatello

Dianbo Liu

Models prone to spurious correlations in training data often produce brittle predictions and introduce unintended biases. Addressing this ch… (voir plus)allenge typically involves methods relying on prior knowledge and group annotation to remove spurious correlations, which may not be readily available in many applications. In this paper, we establish a novel connection between unsupervised object-centric learning and mitigation of spurious correlations. Instead of directly inferring subgroups with varying correlations with labels, our approach focuses on discovering concepts: discrete ideas that are shared across input samples. Leveraging existing object-centric representation learning, we introduce CoBalT: a concept balancing technique that effectively mitigates spurious correlations without requiring human labeling of subgroups. Evaluation across the benchmark datasets for sub-population shifts demonstrate superior or competitive performance compared state-of-the-art baselines, without the need for group annotation. Code is available at https://github.com/rarefin/CoBalT.

2024-07-07

Proceedings of the 41st International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

Discrete Key-Value Bottleneck

Frederik Träuble

Anirudh Goyal

Nasim Rahaman

Michael Mozer

Kenji Kawaguchi

Yoshua Bengio

Bernhard Schölkopf

Deep neural networks perform well on classification tasks where data streams are i.i.d. and labeled data is abundant. Challenges emerge with… (voir plus) non-stationary training data streams such as continual learning. One powerful approach that has addressed this challenge involves pre-training of large encoders on volumes of readily available data, followed by task-specific tuning. Given a new task, however, updating the weights of these encoders is challenging as a large number of weights needs to be fine-tuned, and as a result, they forget information about the previous tasks. In the present work, we propose a model architecture to address this issue, building upon a discrete bottleneck containing pairs of separate and learnable key-value codes. Our paradigm will be to encode; process the representation via a discrete bottleneck; and decode. Here, the input is fed to the pre-trained encoder, the output of the encoder is used to select the nearest keys, and the corresponding values are fed to the decoder to solve the current task. The model can only fetch and re-use a sparse number of these key-value pairs during inference, enabling localized and context-dependent model updates. We theoretically investigate the ability of the discrete key-value bottleneck to minimize the effect of learning under distribution shifts and show that it reduces the complexity of the hypothesis class. We empirically verify the proposed method under challenging class-incremental learning scenarios and show that the proposed model - without any task boundaries - reduces catastrophic forgetting across a wide variety of pre-trained models, outperforming relevant baselines on this task.

2023-07-02

Proceedings of the 40th International Conference on Machine Learning (publié)

doi.org

proceedings.mlr.press

Adaptive Discrete Communication Bottlenecks with Dynamic Vector Quantization for Heterogeneous Representational Coarseness

Dianbo Liu

Alex Lamb

Xu Ji

Pascal Junior Tikeng Notsawo

Michael Mozer

Yoshua Bengio

Kenji Kawaguchi

Vector Quantization (VQ) is a method for discretizing latent representations and has become a major part of the deep learning toolkit. It ha… (voir plus)s been theoretically and empirically shown that discretization of representations leads to improved generalization, including in reinforcement learning where discretization can be used to bottleneck multi-agent communication to promote agent specialization and robustness. The discretization tightness of most VQ-based methods is defined by the number of discrete codes in the representation vector and the codebook size, which are fixed as hyperparameters. In this work, we propose learning to dynamically select discretization tightness conditioned on inputs, based on the hypothesis that data naturally contains variations in complexity that call for different levels of representational coarseness which is observed in many heterogeneous data sets. We show that dynamically varying tightness in communication bottlenecks can improve model performance on visual reasoning and reinforcement learning tasks with heterogeneity in representations.

2023-06-25

AAAI Conference on Artificial Intelligence (publié)

doi.org

arxiv.org

Simplicial Embeddings in Self-Supervised Learning and Downstream Classification

Samuel Lavoie

Simplicial Embeddings (SEM) are representations learned through self-supervised learning (SSL), wherein a representation is projected into …

2023-01-31

ICLR.cc/2023/Conference (notable)

doi.org

openreview.net

GFlowOut: Dropout with Generative Flow Networks

Dianbo Liu

Moksh Jain

Bonaventure F. P. Dossou

Qianli Shen

Salem Lahlou

Anirudh Goyal

Nikolay Malkin

Chris C. Emezue

Bayesian Inference offers principled tools to tackle many critical problems with modern neural networks such as poor calibration and general… (voir plus)ization, and data inefficiency. However, scaling Bayesian inference to large architectures is challenging and requires restrictive approximations. Monte Carlo Dropout has been widely used as a relatively cheap way for approximate Inference and to estimate uncertainty with deep neural networks. Traditionally, the dropout mask is sampled independently from a fixed distribution. Recent works show that the dropout mask can be viewed as a latent variable, which can be inferred with variational inference. These methods face two important challenges: (a) the posterior distribution over masks can be highly multi-modal which can be difficult to approximate with standard variational inference and (b) it is not trivial to fully utilize sample-dependent information and correlation among dropout masks to improve posterior estimation. In this work, we propose GFlowOut to address these issues. GFlowOut leverages the recently proposed probabilistic framework of Generative Flow Networks (GFlowNets) to learn the posterior distribution over dropout masks. We empirically demonstrate that GFlowOut results in predictive distributions that generalize better to out-of-distribution data, and provide uncertainty estimates which lead to better performance in downstream tasks.

2022-12-31

ICML (publié)

doi.org

proceedings.mlr.press

MixupE: Understanding and improving Mixup from directional derivative perspective

Vikas Verma

Yingtian Zou

Sarthak Mittal

Wai Hoh Tang

Hieu Pham

Juho Kannala

Yoshua Bengio

Arno Solin

Kenji Kawaguchi

2022-12-31

UAI (publié)

doi.org

proceedings.mlr.press

Supplementary Material for MixupE

Yingtian Zou

Vikas Verma

Sarthak Mittal

Wai Hoh Tang

Hieu Pham

Juho Kannala

Yoshua Bengio

Arno Solin

Kenji Kawaguchi

We denote by z = (x,y) the input and output pair where x ∈ X ⊆ R and y ∈ Y ⊆ R . Let fθ(x) ∈ R be the output of the logits (i.e.,… (voir plus) the last layer before the softmax or sigmoid) of the model parameterized by θ. We use l(θ, z) = h(fθ(x)) − yfθ(x) to denote the loss function. Let g(·) be the activation function. We use x(i) to index i-th element of the vector x and xj to represent j-th variable in a set. The notation list is:

2022-12-31

(publié)

www.semanticscholar.org

Discrete Factorial Representations as an Abstraction for Goal Conditioned Reinforcement Learning

Hongyu Zang

Xin Li

Romain Laroche

Yoshua Bengio

Remi Tachet des Combes

2022-10-31

arXiv (prépublication)

doi.org

arxiv.org

Interpolated Adversarial Training: Achieving Robust Neural Networks Without Sacrificing Too Much Accuracy

Alex Lamb

Vikas Verma

Kenji Kawaguchi

Juho Kannala

Alexander Matyasko

Yoshua Bengio

Savya Khosla

Adversarial robustness has become a central goal in deep learning, both in theory and in practice. However, successful methods to improve th… (voir plus)e adversarial robustness (such as adversarial training) greatly hurt generalization performance on the unperturbed data. This could have a major impact on how achieving adversarial robustness affects real world systems (i.e. many may opt to forego robustness if it can improve accuracy on the unperturbed data). We propose Interpolated Adversarial Training, which employs recently proposed interpolation based training methods in the framework of adversarial training. On CIFAR-10, adversarial training increases the standard test error (when there is no adversary) from 4.43% to 12.32%, whereas with our Interpolated adversarial training we retain adversarial robustness while achieving a standard test error of only 6.45%. With our technique, the relative increase in the standard error for the robust model is reduced from 178.1% to just 45.5%.

2022-09-30

Neural Networks (publié)

doi.org

arxiv.org

Discrete Factorial Representations as an Abstraction for Goal Conditioned RL

Hongyu Zang

Xin Li

Romain Laroche

Yoshua Bengio

Remi Tachet des Combes

2021-12-31

Neural Information Processing Systems (publié)

openreview.net

GraphMix: Improved Training of GNNs for Semi-Supervised Learning

Juho Kannala

We present GraphMix, a regularization method for Graph Neural Network based semi-supervised object classification, whereby we propose to tra… (voir plus)in a fully-connected network jointly with the graph neural network via parameter sharing and interpolation-based regularization. Further, we provide a theoretical analysis of how GraphMix improves the generalization bounds of the underlying graph neural network, without making any assumptions about the "aggregation" layer or the depth of the graph neural networks. We experimentally validate this analysis by applying GraphMix to various architectures such as Graph Convolutional Networks, Graph Attention Networks and Graph-U-Net. Despite its simplicity, we demonstrate that GraphMix can consistently improve or closely match state-of-the-art performance using even simpler architectures such as Graph Convolutional Networks, across three established graph benchmarks: Cora, Citeseer and Pubmed citation network datasets, as well as three newly proposed datasets: Cora-Full, Co-author-CS and Co-author-Physics.

2021-05-17

AAAI Conference on Artificial Intelligence (publié)

doi.org

openreview.net

Mila sur Udemy

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Kenji Kawaguchi

Publications

Mila sur Udemy

Publications du Fellowship en politiques de l'IA

La plateforme Mila Ventures

Mots-clés populaires:

Kenji Kawaguchi

Publications