Publications

Generative AI in Software Engineering Must Be Human-Centered: The Copenhagen Manifesto

Daniel Russo

Sebastian Baltes

Niels van Berkel

Paris Avgeriou

Fabio Calefato

Beatriz Cabrero-Daniel

Gemma Catolino

Jürgen Cito

Neil Ernst

Thomas Fritz

Hideaki Hata

Reid Holmes

Maliheh Izadi

Foutse Khomh

Mikkel Baun Kjærgaard

Grischa Liebel

Alberto Lluch Lafuente

Stefano Lambiase

Walid Maalej

Gail Murphy … (see 15 more)

Nils Brede Moe

Gabrielle O'Brien

Elda Paja

Mauro Pezzè

John Stouby Persson

Rafael Prikladnicki

Paul Ralph

Martin P. Robillard

Thiago Rocha Silva

Klaas-Jan Stol

Margaret-Anne Storey

Viktoria Stray

Paolo Tell

Christoph Treude

Bogdan Vasilescu

2024-04-30

Journal of Systems and Software (published)

doi.org

Harmony in Diversity: Merging Neural Networks with Canonical Correlation Analysis

Stefan Horoi

Albert Manuel Orozco Camacho

Eugene Belilovsky

Guy Wolf

Combining the predictions of multiple trained models through ensembling is generally a good way to improve accuracy by leveraging the differ… (see more)ent learned features of the models, however it comes with high computational and storage costs. Model fusion, the act of merging multiple models into one by combining their parameters reduces these costs but doesn't work as well in practice. Indeed, neural network loss landscapes are high-dimensional and non-convex and the minima found through learning are typically separated by high loss barriers. Numerous recent works have been focused on finding permutations matching one network features to the features of a second one, lowering the loss barrier on the linear path between them in parameter space. However, permutations are restrictive since they assume a one-to-one mapping between the different models' neurons exists. We propose a new model merging algorithm, CCA Merge, which is based on Canonical Correlation Analysis and aims to maximize the correlations between linear combinations of the model features. We show that our alignment method leads to better performances than past methods when averaging models trained on the same, or differing data splits. We also extend this analysis into the harder setting where more than 2 models are merged, and we find that CCA Merge works significantly better than past methods. Our code is publicly available at https://github.com/shoroi/align-n-merge

2024-04-30

International Conference on Machine Learning (poster)

doi.org

proceedings.mlr.press

Information Complexity of Stochastic Convex Optimization: Applications to Generalization and Memorization

Idan Attias

Gintare Karolina Dziugaite

MAHDI HAGHIFAM

Roi Livni

Daniel M. Roy

In this work, we investigate the interplay between memorization and learning in the context of \emph{stochastic convex optimization} (SCO). … (see more)We define memorization via the information a learning algorithm reveals about its training data points. We then quantify this information using the framework of conditional mutual information (CMI) proposed by Steinke and Zakynthinou (2020). Our main result is a precise characterization of the tradeoff between the accuracy of a learning algorithm and its CMI, answering an open question posed by Livni (2023). We show that, in the

2024-04-30

ICML.cc/2024/Conference (oral)

doi.org

openreview.net

Language-guided Skill Learning with Temporal Variational Inference

Haotian Fu

Pratyusha Sharma

Elias Stengel-Eskin

George Konidaris

Nicolas Le Roux

Marc-Alexandre Côté

Xingdi Yuan

We present an algorithm for skill discovery from expert demonstrations. The algorithm first utilizes Large Language Models (LLMs) to propose… (see more) an initial segmentation of the trajectories. Following that, a hierarchical variational inference framework incorporates the LLM-generated segmentation information to discover reusable skills by merging trajectory segments. To further control the trade-off between compression and reusability, we introduce a novel auxiliary objective based on the Minimum Description Length principle that helps guide this skill discovery process. Our results demonstrate that agents equipped with our method are able to discover skills that help accelerate learning and outperform baseline skill learning approaches on new long-horizon tasks in BabyAI, a grid world navigation environment, as well as ALFRED, a household simulation environment.

2024-04-30

ICML.cc/2024/Conference (poster)

doi.org

proceedings.mlr.press

Learning to Scale Logits for Temperature-Conditional GFlowNets

Joohwan Ko

Woo Chang Kim

Jinkyoo Park

Emmanuel Bengio

Yoshua Bengio

GFlowNets are probabilistic models that sequentially generate compositional structures through a stochastic policy. Among GFlowNets, tempera… (see more)ture-conditional GFlowNets can introduce temperature-based controllability for exploration and exploitation. We propose \textit{Logit-scaling GFlowNets} (Logit-GFN), a novel architectural design that greatly accelerates the training of temperature-conditional GFlowNets. It is based on the idea that previously proposed approaches introduced numerical challenges in the deep network training, since different temperatures may give rise to very different gradient profiles as well as magnitudes of the policy's logits. We find that the challenge is greatly reduced if a learned function of the temperature is used to scale the policy's logits directly. Also, using Logit-GFN, GFlowNets can be improved by having better generalization capabilities in offline learning and mode discovery capabilities in online learning, which is empirically verified in various biological and chemical tasks. Our code is available at https://github.com/dbsxodud-11/logit-gfn

2024-04-30

ICML.cc/2024/Conference (poster)

doi.org

proceedings.mlr.press

Learning the meanings of function words from grounded language using a visual question answering model

Eva Portelance

Michael C. Frank

Dan Jurafsky

Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How a… (see more)re such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spatial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives and and or without any prior knowledge of logical reasoning, as well as early evidence that they are sensitive to alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer proof-of-concept evidence that it is possible to learn the nuanced interpretations of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.

2024-04-30

Cognitive Science (published)

doi.org

arxiv.org

MixEHR-SurG: a joint proportional hazard and guided topic model for inferring mortality-associated topics from electronic health records

Yixuan Li

Ariane Marelli

Archer Y. Yang

Yue Li

Survival models can help medical practitioners to evaluate the prognostic importance of clinical variables to patient outcomes such as morta… (see more)lity or hospital readmission and subsequently design personalized treatment regimes. Electronic Health Records (EHRs) hold the promise for large-scale survival analysis based on systematically recorded clinical features for each patient. However, existing survival models either do not scale to high dimensional and multi-modal EHR data or are difficult to interpret. In this study, we present a supervised topic model called MixEHR-SurG to simultaneously integrate heterogeneous EHR data and model survival hazard. Our contributions are three-folds: (1) integrating EHR topic inference with Cox proportional hazards likelihood; (2) integrating patient-specific topic hyperparameters using the PheCode concepts such that each topic can be identified with exactly one PheCode-associated phenotype; (3) multi-modal survival topic inference. This leads to a highly interpretable survival topic model that can infer PheCode-specific phenotype topics associated with patient mortality. We evaluated MixEHR-SurG using a simulated dataset and two real-world EHR datasets: the Quebec Congenital Heart Disease (CHD) data consisting of 8,211 subjects with 75,187 outpatient claim records of 1,767 unique ICD codes; the MIMIC-III consisting of 1,458 subjects with multi-modal EHR records. Compared to the baselines, MixEHR-SurG achieved a superior dynamic AUROC for mortality prediction, with a mean AUROC score of 0.89 in the simulation dataset and a mean AUROC of 0.645 on the CHD dataset. Qualitatively, MixEHR-SurG associates severe cardiac conditions with high mortality risk among the CHD patients after the first heart failure hospitalization and critical brain injuries with increased mortality among the MIMIC-III patients after their ICU discharge.

2024-04-30

Journal of Biomedical Informatics (published)

doi.org

arxiv.org

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Johan Obando-Ceron

Ghada Sokar

Timon Willi

Clare Lyle

Jesse Farebrother

Jakob Foerster

Karolina Dziugaite

Doina Precup

Pablo Samuel Castro

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance s… (see more)cales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

2024-04-30

ICML.cc/2024/Conference (spotlight)

doi.org

proceedings.mlr.press

Nash Learning from Human Feedback

Remi Munos

Michal Valko

Daniele Calandriello

Mohammad Gheshlaghi Azar

Mark Rowland

Zhaohan Daniel Guo

Yunhao Tang

Matthieu Geist

Thomas Mesnard

Côme Fiegel

Andrea Michi

Marco Selvi

Sertan Girgin

Nikola Momchev

Olivier Bachem

Daniel J Mankowitz

Doina Precup

Bilal Piot

Reinforcement learning from human feedback (RLHF) has emerged as the main paradigm for aligning large language models (LLMs) with human pref… (see more)erences. Traditionally, RLHF involves the initial step of learning a reward model from pairwise human feedback, i.e., expressed as preferences between pairs of text generations. Subsequently, the LLM’s policy is fine-tuned to maximize the reward through a reinforcement learning algorithm. In this study, we introduce an alternative pipeline for the fine-tuning of LLMs using pairwise human feedback. Our approach entails the initial learning of a pairwise preference model, which is conditioned on two inputs (instead of a single input in the case of a reward model) given a prompt, followed by the pursuit of a policy that consistently generates responses preferred over those generated by any competing policy, thus defining the Nash equilibrium of this preference model. We term this approach Nash learning from human feedback (NLHF). In the context of a tabular policy representation, we present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent. This algorithm produces a sequence of policies, with the last iteration converging to the regularized Nash equilibrium. Additionally, we explore parametric representations of policies and introduce gradient descent algorithms for deep-learning architectures. We illustrate the effectiveness of our approach by presenting experimental results on a text summarization task. We believe NLHF offers a compelling avenue for fine-tuning LLMs and enhancing the alignment of LLMs with human preferences.

2024-04-30

ICML.cc/2024/Conference (spotlight)

doi.org

proceedings.mlr.press

Nilearn and Big Data Facilitates Transdiagnostic Brain Biomarkers

Hao-Ting Wang

Natasha Clarke

Quentin Dessain

Fraçois Paugam

Lune P Bellec

2024-04-30

Biological Psychiatry (published)

doi.org

Performance of generative pre-trained transformers (GPTs) in Certification Examination of the College of Family Physicians of Canada

Mehdi Mousavi

Shabnam Shafiee

Jason M Harley

Jackie Chi Kit Cheung

Samira Abbasgholizadeh Rahimi

The application of large language models such as generative pre-trained transformers (GPTs) has been promising in medical education, and its… (see more) performance has been tested for different medical exams. This study aims to assess the performance of GPTs in responding to a set of sample questions of short-answer management problems (SAMPs) from the certification exam of the College of Family Physicians of Canada (CFPC). Between August 8th and 25th, 2023, we used GPT-3.5 and GPT-4 in five rounds to answer a sample of 77 SAMPs questions from the CFPC website. Two independent certified family physician reviewers scored AI-generated responses twice: first, according to the CFPC answer key (ie, CFPC score), and second, based on their knowledge and other references (ie, Reviews’ score). An ordinal logistic generalised estimating equations (GEE) model was applied to analyse repeated measures across the five rounds. According to the CFPC answer key, 607 (73.6%) lines of answers by GPT-3.5 and 691 (81%) by GPT-4 were deemed accurate. Reviewer’s scoring suggested that about 84% of the lines of answers provided by GPT-3.5 and 93% of GPT-4 were correct. The GEE analysis confirmed that over five rounds, the likelihood of achieving a higher CFPC Score Percentage for GPT-4 was 2.31 times more than GPT-3.5 (OR: 2.31; 95% CI: 1.53 to 3.47; p0.001). Similarly, the Reviewers’ Score percentage for responses provided by GPT-4 over 5 rounds were 2.23 times more likely to exceed th

2024-04-30

Family Medicine and Community Health (published)

doi.org

Position: Application-Driven Innovation in Machine Learning

David Rolnick

Alán Aspuru-Guzik

Sara Beery

Bistra Dilkina

Priya L. Donti

Marzyeh Ghassemi

Hannah Kerner

Claire Monteleoni

Esther Rolf

Milind Tambe

Adam White

2024-04-30

ICML.cc/2024/Conference (poster)

proceedings.mlr.press

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications