Publications

Learning diverse attacks on large language models for robust red-teaming and safety tuning

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (voir plus)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (prépublication)

doi.org

arxiv.org

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee

Minsu Kim

Lynn Cherif

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (voir plus)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (prépublication)

doi.org

arxiv.org

Learning diverse attacks on large language models for robust red-teaming and safety tuning

Seanie Lee

Minsu Kim

Lynn Cherif

David Dobre

Juho Lee

Sung Ju Hwang

Moksh J. Jain

Red-teaming, or identifying prompts that elicit harmful responses, is a critical step in ensuring the safe and responsible deployment of lar… (voir plus)ge language models (LLMs). Developing effective protection against many modes of attack prompts requires discovering diverse attacks. Automated red-teaming typically uses reinforcement learning to fine-tune an attacker language model to generate prompts that elicit undesirable responses from a target LLM, as measured, for example, by an auxiliary toxicity classifier. We show that even with explicit regularization to favor novelty and diversity, existing approaches suffer from mode collapse or fail to generate effective attacks. As a flexible and probabilistically principled alternative, we propose to use GFlowNet fine-tuning, followed by a secondary smoothing phase, to train the attacker model to generate diverse and effective attack prompts. We find that the attacks generated by our method are effective against a wide range of target LLMs, both with and without safety tuning, and transfer well between target LLMs. Finally, we demonstrate that models safety-tuned using a dataset of red-teaming prompts generated by our method are robust to attacks from other RL-based red-teaming approaches.

2024-05-28

ArXiv (prépublication)

doi.org

arxiv.org

Structured Learning in Time-dependent Cox Models

Guanbo Wang

Yi Lian

Archer Yang

Robert W. Platt

Rui Wang

Sylvie Perreault

Marc Dorais

Mireille E. Schnitzer

2024-05-28

Statistics in Medicine (publié)

doi.org

arxiv.org

The Cost of Arbitrariness for Individuals: Examining the Legal and Technical Challenges of Model Multiplicity

Prakhar Ganesh

Ihsan Ibrahim Daldaban

Ignacio Cofone

Golnoosh Farnadi

Model multiplicity, the phenomenon where multiple models achieve similar performance despite different underlying learned functions, introdu… (voir plus)ces arbitrariness in model selection. While this arbitrariness may seem inconsequential in expectation, its impact on individuals can be severe. This paper explores various individual concerns stemming from multiplicity, including the effects of arbitrariness beyond final predictions, disparate arbitrariness for individuals belonging to protected groups, and the challenges associated with the arbitrariness of a single algorithmic system creating a monopoly across various contexts. It provides both an empirical examination of these concerns and a comprehensive analysis from the legal standpoint, addressing how these issues are perceived in the anti-discrimination law in Canada. We conclude the discussion with technical challenges in the current landscape of model multiplicity to meet legal requirements and the legal gap between current law and the implications of arbitrariness in model selection, highlighting relevant future research directions for both disciplines.

2024-05-28

ArXiv (prépublication)

doi.org

arxiv.org

Towards a Reliable French Speech Recognition Tool for an Automated Diagnosis of Learning Disabilities

Jihene Rezgui

Félix Jobin

Younes Kechout

Chritine Turgeon

Foutse Khomh

Dyslexia, characterized by severe challenges in reading and spelling acquisition, presents a substantial barrier to proficient literacy, res… (voir plus)ulting in significantly reduced reading speed (2 to 3 times slower) and diminished text comprehension. With a prevalence ranging from 5G to 10% in the population, early intervention by speech and language pathologists (SLPs) can mitigate dyslexia's effects, but the diagnosis bottleneck impedes timely support. To address this, we propose leveraging machine learning tools to expedite the diagnosis process, focusing on automating phonetic transcription, a critical step in dyslexia assessment. We investigated the practicality of two model configurations utilizing Google's speech-to-text API with children speech in evaluation scenarios and compared their results against transcriptions crafted by experts. The first configuration focuses on Google API's speech-to-text while the second integrates Phonemizer, a text-to-phonemes tool based on a dictionary. Results analysis indicate that our Google-Phonemizer model yields reading accuracies comparable to those computed from human-made transcriptions, offering promise for clinical application. These findings underscore the potential of AI-driven solutions to enhance dyslexia diagnosis efficiency, paving the way for improved accessibility to vital SLP services.

2024-05-28

2024 International Conference on Smart Applications, Communications and Networking (SmartNets) (publié)

doi.org

Advancing Cultural Inclusivity: Optimizing Embedding Spaces for Balanced Music Recommendations

Armin Moradi

Nicola Neophytou

Golnoosh Farnadi

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

Comparative Study of Large Language Model Architectures on Frontier

Junqi Yin

Joey Bose

Guojing Cong

Isaac Lyngaas

Quentin Gregory Anthony

Large language models (LLMs) have garnered significant attention in both the AI community and beyond. Among these, the Generative Pre-traine… (voir plus)d Transformer (GPT) has emerged as the dominant architecture, spawning numerous variants. However, these variants have undergone pre-training under diverse conditions, including variations in input data, data preprocessing, and training methodologies, resulting in a lack of controlled comparative studies. Here we meticulously examine two prominent open-sourced GPT architectures, GPT-NeoX and LLaMA, leveraging the computational power of Frontier, the world’s first Exascale supercomputer. Employing the same materials science text corpus and a comprehensive end-to-end pipeline, we conduct a comparative analysis of their training and downstream performance. Our efforts culminate in achieving state-of-the-art performance on a challenging materials science benchmark. Furthermore, we investigate the computation and energy efficiency, and propose a computationally efficient method for architecture design. To our knowledge, these pre-trained models represent the largest available for materials science. Our findings provide practical guidance for building LLMs on HPC platforms.

2024-05-27

2024 IEEE International Parallel and Distributed Processing Symposium (IPDPS) (publié)

doi.org

arxiv.org

Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport

Muhammad Haseeb Aslam

Muhammad Osama Zeeshan

Soufiane Belharbi

Marco Pedersoli

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments because… (voir plus) of their ability to learn complementary and redundant semantic information. However, these models struggle in the wild, mainly because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training. State-of-the-art knowledge distillation (KD) methods have been proposed to distill information from multiple teacher models (each trained on a modality) to a common student model. These privileged KD methods typically utilize point-to-point matching, yet have no explicit mechanism to capture the structural information in the teacher representation space formed by introducing the privileged modality. We argue that encoding this same structure in the student space may lead to enhanced student performance. This paper introduces a new structural KD mechanism based on optimal transport (OT), where entropy-regularized OT distills the structural dark knowledge. Our privileged KD with OT (PKDOT) method captures the local structures in the multimodal teacher representation by calculating a cosine similarity matrix and selecting the top-k anchors to allow for sparse OT solutions, resulting in a more stable distillation process. Experiments1 were performed on two challenging problems - pain estimation on the Biovid dataset (ordinal classification) and arousal-valance prediction on the Affwild2 dataset (regression). Results show that our proposed method can outperform state-of-the-art privileged KD methods on these problems. The diversity among modalities and fusion architectures indicates that PKDOT is modality-and model-agnostic.

2024-05-27

2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) (publié)

doi.org

arxiv.org

Estimating Expectations without Sampling: Neural Stein Estimation

Mohsin Hasan

Dinghuai Zhang

Cheikh Ahmed

Awa Khouna

Yoshua Bengio

We propose a method for estimating the expected value of a given function …

2024-05-27

approximateinference.org/AABI/2024/Symposium (accepté)

openreview.net

Guided Interpretable Facial Expression Recognition via Spatial Action Unit Cues

Soufiane Belharbi

Marco Pedersoli

Alessandro Lameiras Koerich

Simon Bacon

Eric Granger

Although state-of-the-art classifiers for facial expression recognition (FER) can achieve a high level of accuracy, they lack interpretabili… (voir plus)ty, an important feature for end-users. Experts typically associate spatial action units (AUs) from a codebook to facial regions for the visual interpretation of expressions. In this paper, the same expert steps are followed. A new learning strategy is proposed to explicitly incorporate AU cues into classifier training, allowing to train deep interpretable models. During training, this AU codebook is used, along with the input image expression label, and facial landmarks, to construct a AU heatmap that indicates the most discriminative image regions of interest w.r.t the facial expression. This valuable spatial cue is leveraged to train a deep interpretable classifier for FER. This is achieved by constraining the spatial layer features of a classifier to be correlated with AU heatmaps. Using a composite loss, the classifier is trained to correctly classify an image while yielding interpretable visual layer-wise attention correlated with AU maps, simulating the expert decision process. Our strategy only relies on image class expression for supervision, without additional manual annotations. Our new strategy is generic, and can be applied to any deep CNN - or transformer-based classifier without requiring any architectural change or significant additional training time. Our extensive evaluation11Our code is available at:https://github.com/sbelharbi/interpretable-fer-aus. on two public benchmarks RAF-DB, and AffectNet datasets shows that our proposed strategy can improve layer-wise interpretability without degrading classification performance. In addition, we explore a common type of interpretable classifiers that rely on class activation mapping (CAM) methods, and show that our approach can also improve CAM interpretability.

2024-05-27

2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) (publié)

doi.org

arxiv.org

Implicitly Bayesian Prediction Rules in Deep Learning

Bruno Mlodozeniec

David Scott Krueger

Richard E. Turner

The Bayesian approach leads to coherent updates of predictions under new data, which makes adhering to Bayesian principles appealing in deci… (voir plus)sion-making contexts. Traditionally, integrating Bayesian principles into models like deep neural networks involves setting priors on parameters and approximating posteriors. This is done despite the fact that, typically, priors on parameters reflect any prior beliefs only insofar as they dictate function space behaviour. In this paper, we rethink this approach and consider what properties characterise a prediction rule as being Bayesian. Algorithms meeting such criteria can be deemed implicitly Bayesian — they make the same predictions as some Bayesian model, without explicitly manifesting priors and posteriors. We argue this might be a more fruitful approach towards integrating Bayesian principles into deep learning. In this paper, we propose how to measure how close a general prediction rule is to being implicitly Bayesian, and empirically evaluate multiple prediction strategies using our approach. We also show theoretically that agents relying on non-implicitly Bayesian prediction rules can be easily exploited in adversarial betting settings.

2024-05-27

approximateinference.org/AABI/2024/Symposium_Archival_Track (accepté)

proceedings.mlr.press

openreview.net

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Éclaireurs autochtones en IA

Avantage IA

Publications

Hackathon | Créer une IA plus sécuritaire pour la santé mentale des jeunes

Éclaireurs autochtones en IA

Avantage IA

Mots-clés populaires:

Publications