Publications

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar

J.D. Zamfirescu-Pereira

Bjorn Hartmann

Aditya G Parameswaran

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (see more)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

2024-04-18

ArXiv (preprint)

doi.org

arxiv.org

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar

J.D. Zamfirescu-Pereira

Bjorn Hartmann

Aditya G Parameswaran

Ian Arawjo

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (see more)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

2024-04-18

ArXiv (preprint)

doi.org

arxiv.org

Asynchronous Algorithmic Alignment with Cocycles

Andrew Joseph Dudzik

Tamara von Glehn

Razvan Pascanu

Petar Veličković

State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinc… (see more)tion between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph. But more importantly, many intermediate GNN steps have to learn the identity functions, which is a non-trivial learning problem. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks. Our analysis yields several practical implementations of synchronous scalable GNN layers that are provably invariant under various forms of asynchrony.

2024-04-17

Proceedings of the Second Learning on Graphs Conference (published)

doi.org

arxiv.org

Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes

Guillaume Huguet

Thomas Renne

Cécile Poulain

Alma Dubuc

Kuldeep Kumar

Sayeh Kazem

Worrawat Engchuan

Omar Shanta

Elise Douard

Catherine Proulx

Martineau Jean-Louis

Zohra Saci

Josephine Mollon

Laura Schultz

Emma E M Knowles

Simon R. Cox

David Porteous

Gail Davies

Paul Redmond

Sarah E. Harris … (see 10 more)

Gunter Schumann

Guillaume Dumas

Aurélie Labbe

Zdenka Pausova

Tomas Paus

Stephen W Scherer

Jonathan Sebat

Laura Almasy

David C. Glahn

Sébastien Jacquemont

Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (see more)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.

2024-04-17

bioRxiv (preprint)

doi.org

Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes

Guillaume Huguet

Thomas Renne

Cécile Poulain

Alma Dubuc

Kuldeep Kumar

Sayeh Kazem

Worrawat Engchuan

Omar Shanta

Elise Douard

Catherine Proulx

Martineau Jean-Louis

Zohra Saci

Josephine Mollon

Laura Schultz

Emma E M Knowles

Simon R. Cox

David Porteous

Gail Davies

Paul Redmond

Sarah E. Harris … (see 10 more)

Gunter Schumann

Guillaume Dumas

Aurélie Labbe

Zdenka Pausova

Tomas Paus

Stephen W Scherer

Jonathan Sebat

Laura Almasy

David C. Glahn

Sébastien Jacquemont

Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (see more)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.

2024-04-17

bioRxiv (preprint)

doi.org

Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes

Guillaume Huguet

Thomas Renne

Cécile Poulain

Alma Dubuc

Kuldeep Kumar

Sayeh Kazem

Worrawat Engchuan

Omar Shanta

Elise Douard

Catherine Proulx

Martineau Jean-Louis

Zohra Saci

Josephine Mollon

Laura Schultz

Emma E M Knowles

Simon R. Cox

David Porteous

Gail Davies

Paul Redmond

Sarah E. Harris … (see 10 more)

Gunter Schumann

Guillaume Dumas

Aurélie Labbe

Zdenka Pausova

Tomas Paus

Stephen W Scherer

Jonathan Sebat

Laura Almasy

David C. Glahn

Sébastien Jacquemont

Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (see more)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.

2024-04-17

bioRxiv (preprint)

doi.org

Latent Space Representations of Neural Algorithmic Reasoners

Vladimir V. Mirjani'c

Razvan Pascanu

Petar Velivckovi'c University of Cambridge

Petar Veličković

Google Deepmind

Neural Algorithmic Reasoning (NAR) is a research area focused on designing neural architectures that can reliably capture classical computat… (see more)ion, usually by learning to execute algorithms. A typical approach is to rely on Graph Neural Network (GNN) architectures, which encode inputs in high-dimensional latent spaces that are repeatedly transformed during the execution of the algorithm. In this work we perform a detailed analysis of the structure of the latent space induced by the GNN when executing algorithms. We identify two possible failure modes: (i) loss of resolution, making it hard to distinguish similar values; (ii) inability to deal with values outside the range observed during training. We propose to solve the first issue by relying on a softmax aggregator, and propose to decay the latent space in order to deal with out-of-range values. We show that these changes lead to improvements on the majority of algorithms in the standard CLRS-30 benchmark when using the state-of-the-art Triplet-GMPNN processor. Our code is available at https://github.com/mirjanic/nar-latent-spaces

2024-04-17

Proceedings of the Second Learning on Graphs Conference (published)

doi.org

arxiv.org

Many-Shot In-Context Learning

Rishabh Agarwal

Avi Singh

Lei M Zhang

Bernd Bohnet

Stephanie C.Y. Chan

Ankesh Anand

Zaheer Abbas

Azade Nova

John D Co-Reyes

Eric Chu

Feryal Behbahani

Aleksandra Faust

Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (see more)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

2024-04-17

ArXiv (preprint)

doi.org

arxiv.org

Many-Shot In-Context Learning

Rishabh Agarwal

Avi Singh

Lei M Zhang

Bernd Bohnet

Stephanie C.Y. Chan

Ankesh Anand

Zaheer Abbas

Azade Nova

John D Co-Reyes

Eric Chu

Feryal Behbahani

Aleksandra Faust

Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (see more)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

2024-04-17

ArXiv (preprint)

doi.org

arxiv.org

Many-Shot In-Context Learning

Rishabh Agarwal

Avi Singh

Lei M Zhang

Bernd Bohnet

Stephanie C.Y. Chan

Ankesh Anand

Zaheer Abbas

Azade Nova

John D Co-Reyes

Eric Chu

Feryal Behbahani

Aleksandra Faust

Hugo Larochelle

Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (see more)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.

2024-04-17

ArXiv (preprint)

doi.org

arxiv.org

On the Scalability of GNNs for Molecular Graphs

Maciej Sypetkowski

Frederik Wenkel

Farimah Poursafaei

Nia Dickson

Karush Suri

Philip Fradkin

Dominique Beaini

Scaling deep learning models has been at the heart of recent revolutions in language modelling and image generation. Practitioners have obse… (see more)rved a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets. We further demonstrate strong finetuning scaling behavior on 38 highly competitive downstream tasks, outclassing previous large models. This gives rise to MolGPS, a new graph foundation model that allows to navigate the chemical space, outperforming the previous state-of-the-arts on 26 out the 38 downstream tasks. We hope that our work paves the way for an era where foundational GNNs drive pharmaceutical drug discovery.

2024-04-17

ArXiv (preprint)

doi.org

arxiv.org

Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology

Oren Kraus

Kian Kenyon-Dean

Saber Saberian

Maryam Fallah

Peter McLean

Jess Leung

Vasudev Sharma

Ayla Khan

Jia Balakrishnan

Safiye Celik

Dominique Beaini

Maciej Sypetkowski

Chi Vicky Cheng

Kristen Morse

Maureen Makes

Ben Mabey

Berton Earnshaw

Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spannin… (see more)g millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.

2024-04-16

ArXiv (preprint)

doi.org

arxiv.org

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Publications