Publications

Capture the Flag: Uncovering Data Insights with Large Language Models.

Issam H. Laradji

Perouz Taslakian

Sai Rajeswar

Valentina Zantedeschi

Alexandre Lacoste

Nicolas Chapados

David Vázquez

Christopher Pal

Alexandre Drouin

The extraction of a small number of relevant insights from vast amounts of data is a crucial component of data-driven decision-making. Howev… (see more)er, accomplishing this task requires considerable technical skills, domain expertise, and human labor. This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data, leveraging recent advances in reasoning and code generation techniques. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset. We further propose two proof-of-concept agents, with different inner workings, and compare their ability to capture such flags in a real-world sales dataset. While the work reported here is preliminary, our results are sufficiently interesting to mandate future exploration by the community.

2023-11-06

NeurIPS.cc/2023/Workshop/FMDM (published)

doi.org

openreview.net

The Unsolved Challenges of LLMs as Generalist Web Agents: A Case Study

Rim Assouel

Tom Marty

Massimo Caccia

Issam Hadj Laradji

Alexandre Drouin

Sai Rajeswar

Hector Palacios

Quentin Cappart

David Vázquez

Nicolas Chapados

Maxime Gasse

Alexandre Lacoste

2023-11-06

NeurIPS.cc/2023/Workshop/FMDM (published)

openreview.net

30×30 biodiversity gains rely on national coordination

Isaac Eckert

Andrea Brown

Dominique Caron

Federico Riva

Laura J. Pollock

2023-11-05

Nature Communications (published)

doi.org

Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs

Samy Coulombe

Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly dete… (see more)ction in temporal graphs is crucial for many real world applications such as intrusion identification in network systems, detection of ecosystem disturbances and detection of epidemic outbreaks. In this paper, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: i). how to compare graph snapshots across time, ii). how to capture temporal dependencies, and iii). how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short term and long term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that i). LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, ii). MultiLAD's advantage over contenders significantly increases when additional views are available, and iii). MultiLAD is highly robust to noise from individual views. In five real world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.

2023-11-05

ACM Transactions on Knowledge Discovery from Data (published)

doi.org

arxiv.org

Coordination among leaf and fine root traits across a strong natural soil fertility gradient

Xavier Guilbeault-Mayers

Hans Lambers

Étienne Laliberté

2023-11-04

bioRxiv (preprint)

doi.org

Player-Guided AI outperforms standard AI in Sequence Alignment Puzzles.

Renata Mutalova

Roman Sarrazin-Gendron

Parham Ghasemloo Gheidari

Eddie Cai

Gabriel Richard

Sébastien Caisse

Rob Knight

Mathieu Blanchette

Attila Szantner

Jérôme Waldispühl

Although Artificial Intelligence (AI) has gained widespread popularity across different fields, it is essential to recognize that AI systems… (see more), while impressive, do not consistently exhibit robust generalization, particularly for difficult problems such as the Multiple Sequence Alignment (MSA). In this study, we focus on bridging this performance gap by integrating human solutions into AI training. To illustrate these principles, we leverage data from Borderlands Science, a popular citizen science game in which small instances of the MSA problem are represented as puzzles. Our goal is to leverage the collective intelligence of human players to enhance the capabilities of AI agents. To achieve this, we have developed a Player-guided AI system that enables the AI model to learn from both standard training processes and the solutions provided by players. Our findings demonstrate that incorporating human-annotated information into the AI model improves its performance on puzzle tasks. Furthermore, the Player-guided AI model shows a decrease in noise compared to a pure AI model. This advancement allows for leveraging the model to align new sequences with improved accuracy and effectiveness. Moreover, this research brings attention to the potential of integrating AI and human expertise to address other challenges where the performance of AI models may be unsatisfactory.

2023-11-04

International Conference on Climate Informatics (published)

doi.org

The feature landscape of visual cortex

Rudi Tong

Ronan da Silva

James Wilsenach

Stuart Trenholm

Understanding computations in the visual system requires a characterization of the distinct feature preferences of neurons in different visu… (see more)al cortical areas. However, we know little about how feature preferences of neurons within a given area relate to that area’s role within the global organization of visual cortex. To address this, we recorded from thousands of neurons across six visual cortical areas in mouse and leveraged generative AI methods combined with closed-loop neuronal recordings to identify each neuron’s visual feature preference. First, we discovered that the mouse’s visual system is globally organized to encode features in a manner invariant to the types of image transformations induced by self-motion. Second, we found differences in the visual feature preferences of each area and that these differences generalized across animals. Finally, we observed that a given area’s collection of preferred stimuli (‘own-stimuli’) drive neurons from the same area more effectively through their dynamic range compared to preferred stimuli from other areas (‘other-stimuli’). As a result, feature preferences of neurons within an area are organized to maximally encode differences among own-stimuli while remaining insensitive to differences among other-stimuli. These results reveal how visual areas work together to efficiently encode information about the external world.

2023-11-04

bioRxiv (preprint)

doi.org

Score-Based Likelihood Characterization for Inverse Problems in the Presence of Non-Gaussian Noise

Ronan Legin

Alexandre Adam

Yashar Hezaveh

Laurence Perreault-Levasseur

Likelihood analysis is typically limited to normally distributed noise due to the difficulty of determining the probability density function… (see more) of complex, high-dimensional, non-Gaussian, and anisotropic noise. This work presents Score-based LIkelihood Characterization (SLIC), a framework that resolves this issue by building a data-driven noise model using a set of noise realizations from observations. We show that the approach produces unbiased and precise likelihoods even in the presence of highly non-Gaussian correlated and spatially varying noise. We use diffusion generative models to estimate the gradient of the probability density of noise with respect to data elements. In combination with the Jacobian of the physical model of the signal, we use Langevin sampling to produce independent samples from the unbiased likelihood. We demonstrate the effectiveness of the method using real data from the Hubble Space Telescope and James Webb Space Telescope.

2023-11-02

NeurIPS.cc/2023/Workshop/Deep_Inverse (poster)

openreview.net

Empowering Clinicians with MeDT: A Framework for Sepsis Treatment

Aamer Abdul Rahman

Pranav Agarwal

Vincent Michalski

Rita Noumeir

S Ebrahimi Kahou

2023-11-01

NeurIPS.cc/2023/Workshop/GCRL (published)

openreview.net

Goal Misgeneralization as Implicit Goal Conditioning

Diego Dorn

Neel Alex

David M. Krueger

2023-11-01

NeurIPS.cc/2023/Workshop/GCRL (published)

openreview.net

How does fine-tuning affect your model? Mechanistic analysis on procedural tasks

Samyak Jain

Robert Kirk

Ekdeep Singh Lubana

Robert P. Dick

Hidenori Tanaka

Tim Rocktäschel

Edward Grefenstette

David M. Krueger

Fine-tuning large pre-trained models has become the *de facto* strategy for developing models that are safe to deploy. However, there has be… (see more)en little work that explains how fine-tuning alters the underlying capabilities learnt by a model during pretraining: does fine-tuning yield entirely novel capabilities or does it just modulate existing ones? We address this question empirically in *synthetic* settings with mechanistic interpretability tools (e.g., network pruning and probing) to understand how the model's underlying capabilities are changing. Our extensive analysis of the effects of fine-tuning shows: (i) fine-tuning rarely alters the underlying model capabilities; (ii) a minimal transformation, which we call a 'wrapper', is typically learned on top of the underlying model capabilities; and (iii) further fine-tuning on a task where such wrapped capabilities are relevant leads to sample-efficient "revival'' of the capability, i.e., the model begins reusing this capability in a few gradient steps. *This indicates practitioners can unintentionally remove a model's safety wrapper by merely fine-tuning it on a superficially unrelated task.* We additionally perform analysis on language models trained on the TinyStories dataset to support our claims in a more realistic setup.

2023-11-01

NeurIPS.cc/2023/Workshop/UniReps (poster)

openreview.net

Identification of Acute Myeloid Leukemia Cell Surface Therapeutic Targets Using Single Cell RNA Sequencing Supported By Surface Proteomics

Véronique Lisi

Banafsheh Khakipoor

Azer Farah

Marie-Eve Bordeleau

Éric Audemard

Arnaud Metois

Louis Theret

Jean-François Spinella

Jalila Chagraoui

Ossama Moujaber

Laure Mallinger

Isabel Boivin

Nadine Mayotte

Azadeh Hajmirza

Éric Bonneil

Francois Béliveau

Albert Feghali

Geneviève Boucher

Patrick Gendron

Frederic Barabe … (see 6 more)

Sébastien Lemieux

Guillaume Richard-Carpentier

Josée Hébert

Philippe Roux

Guy Sauvageau

Vincent-Philippe Lavallee

Background: Acute myeloid leukemia (AML) comprises diverse genomic subgroups and remains hard to treat in most patients. Desp… (see more)ite breakthroughs in the therapeutic arsenal in recent years, clinical usage of therapeutic antibodies or chimeric antigen receptor T (CAR-T) cells has been lagging in contrast to other hematological malignancies. In fact, CD33 represents the only antibody-based strategy approved for this disease to date, highlighting the need to identify new promising targets. AML cells span a wide range of aberrant myeloid differentiation programs, complexifying the identification, by bulk genomics, of targets expressed in the most immature leukemic cells. Aims and Methods: To identify the expression landscape of surface proteins in immature leukemic cells, we performed single-cell RNA sequencing (scRNA-seq, 10x 3' Reagent Kits) of primary human AML cells from 20 specimens of the Leucegene cohort enriched in intermediate and adverse genetic backgrounds ( KMT2A-rearranged n=5, chromosome 5 and/or 7 deletions (abn5/7, n=5) complex karyotype (n=4), NPM1/DNMT3A/FLT3-ITD triple-mutant (n=3) and others (n=3)). A Random Forest classifier was developed to unbiasedly classify AML cells into distinct differentiation stages using normal bone marrow-derived scRNA-seq data from the Human Cell Atlas (HCA) consortium. Genes were scored based on their probability of coding for proteins expressed at the cell surface using the SPAT algorithm developed by our group (https://doi.org/10.1101/2023.07.07.547075), retaining high score ones. To validate surface expression, we concomitantly analyzed the surface proteome (hereafter named surfaceome) of 100 primary human AML samples from the Leucegene cohort, including all 20 samples profiled by scRNA-seq. Results: After quality control, we profiled and characterized 103 690 high quality cells (mean of 5185 cells/sample). We trained a Random Forest classifier to annotate cells in a two step process, first identifying plasma cells based on a restricted list of genes abundantly expressed in these cells and subsequently assigning the remaining cells to one of 33 cell types. We performed a five-fold cross validation of the model and subsequently determined the accuracy of our classifier to be 92% on the test subset of the HCA data. Applied to our AML cell collection, a total of 35 053 cells (34%) were unbiasedly classified as Hematopoietic Stem Cell (HSC)-like, corresponding to the most phenotypically immature leukemic cells in each patient sample (ranging from 4 to 74 %). Accordingly, HSC-like AML cells preferentially express genes associated with normal HSCs, such as CD34, FAM30A, and SPINK2, and globally lack expression of mature lineages defining genes, further validating our classifier. The proportion of HSC-like cells varied among AML subgroups, and was lowest in KMT2A-r AML (median 19%) and highest in abn5/7 samples (46%). Integration of our AML atlas using Harmony algorithm preserved differentiation hierarchies across samples, with most cell types, including HSC-like cells, occupying a defined area in the low dimensional embedding. To identify new surface antigens specifically expressed in immature leukemic cells, we compared the high (≥8) SPAT score gene expression profile of AML HSC-like cells with that of normal HSC cells (HCA), and identified 60 genes significantly overexpressed in AML immature cells. Of those, 39 genes were also detected at the protein level by the surfaceome analysis, supporting their predicted expression at the cell surface in AML samples. 59% of these 39 genes (n=23) were detected in over 80% of the specimens analyzed by the surfaceome, and thus are nearly universally expressed in our AML cohort. To identify targets of therapies that could be repurposed, we next evaluated the relevance of our findings by querying the Thera-SAbDab database. Most interestingly, 8 of the 39 AML specific HSC markers are targeted by therapeutic antibodies FDA-approved or in clinical trials for the treatment of AML (n=4, IL3RA, FLT3, CD37 and TNFRSF10B) or other indications (n = 4). Conclusion Our genetically diverse AML single-cell atlas, supported by mass spectrometry, enables the identification of both subset-specific and pan-AML surface protein genes. These represent potential targets for antibody based strategy development or therapy repurposing in AML.

2023-11-01

Blood (published)

doi.org

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Publications

TRAIL: Responsible AI for Professionals and Leaders

Mila Ventures Founder in Residence

AI Advantage: Productivity in Public Service

Popular keywords:

Publications