Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Shreya Shankar
J.D. Zamfirescu-Pereira
Bjorn Hartmann
Aditya G Parameswaran
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (voir plus)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences
Shreya Shankar
J.D. Zamfirescu-Pereira
Bjorn Hartmann
Aditya G Parameswaran
Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (voir plus)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.
Asynchronous Algorithmic Alignment with Cocycles
Andrew Joseph Dudzik
Tamara von Glehn
Petar Veličković
State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinc… (voir plus)tion between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph. But more importantly, many intermediate GNN steps have to learn the identity functions, which is a non-trivial learning problem. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks. Our analysis yields several practical implementations of synchronous scalable GNN layers that are provably invariant under various forms of asynchrony.
Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes
Guillaume Huguet
Thomas Renne
Cécile Poulain
Alma Dubuc
Kuldeep Kumar
Sayeh Kazem
Worrawat Engchuan
Omar Shanta
Elise Douard
Catherine Proulx
Martineau Jean-Louis
Zohra Saci
Josephine Mollon
Laura Schultz
Emma E M Knowles
Simon R. Cox
David Porteous
Gail Davies
Paul Redmond
Sarah E. Harris … (voir 10 de plus)
Gunter Schumann
Aurélie Labbe
Zdenka Pausova
Tomas Paus
Stephen W Scherer
Jonathan Sebat
Laura Almasy
David C. Glahn
Sébastien Jacquemont
Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (voir plus)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.
Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes
Guillaume Huguet
Thomas Renne
Cécile Poulain
Alma Dubuc
Kuldeep Kumar
Sayeh Kazem
Worrawat Engchuan
Omar Shanta
Elise Douard
Catherine Proulx
Martineau Jean-Louis
Zohra Saci
Josephine Mollon
Laura Schultz
Emma E M Knowles
Simon R. Cox
David Porteous
Gail Davies
Paul Redmond
Sarah E. Harris … (voir 10 de plus)
Gunter Schumann
Aurélie Labbe
Zdenka Pausova
Tomas Paus
Stephen W Scherer
Jonathan Sebat
Laura Almasy
David C. Glahn
Sébastien Jacquemont
Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (voir plus)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.
Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes
Guillaume Huguet
Thomas Renne
Cécile Poulain
Alma Dubuc
Kuldeep Kumar
Sayeh Kazem
Worrawat Engchuan
Omar Shanta
Elise Douard
Catherine Proulx
Martineau Jean-Louis
Zohra Saci
Josephine Mollon
Laura Schultz
Emma E M Knowles
Simon R. Cox
David Porteous
Gail Davies
Paul Redmond
Sarah E. Harris … (voir 10 de plus)
Gunter Schumann
Aurélie Labbe
Zdenka Pausova
Tomas Paus
Stephen W Scherer
Jonathan Sebat
Laura Almasy
David C. Glahn
Sébastien Jacquemont
Genomic Copy Number Variants (CNVs) that increase risk for neurodevelopmental disorders are also associated with lower cognitive ability in … (voir plus)general population cohorts. Studies have focussed on a small set of recurrent CNVs, but burden analyses suggested that the vast majority of CNVs affecting cognitive ability are too rare to reach variant-level association. As a result, the full range of gene-dosage-sensitive biological processes linked to cognitive ability remains unknown. To investigate this issue, we identified all CNVs >50 kilobases in 258k individuals from 6 general population cohorts with assessments of general cognitive abilities. We performed a CNV-GWAS and functional burden analyses, which tested 6502 gene-sets defined by tissue and cell-type transcriptomics as well as gene ontology disrupted by all rare coding CNVs. CNV-GWAS identified a novel duplication at 2q12.3 associated with higher performance in cognitive ability. Among the 864 gene-sets associated with cognitive ability, only 11% showed significant effects for both deletions and duplication. Accordingly, we systematically observed negative correlations between deletion and duplication effect sizes across all levels of biological observations. We quantified the preferential effects of deletions versus duplication using tagDS, a new normalized metric. Cognitive ability was preferentially affected by cortical, presynaptic, and negative-regulation gene-sets when duplicated. In contrast, preferential effects of deletions were observed for subcortical, post-synaptic, and positive-regulation gene-sets. A large proportion of gene-sets assigned to non-brain organs were associated with cognitive ability due to low tissue specificity genes, which were associated with higher sensitive to haploinsufficiency. Overall, most biological functions associated with cognitive ability are divided into those sensitive to either deletion or duplications.
Latent Space Representations of Neural Algorithmic Reasoners
Vladimir V. Mirjani'c
Petar Velivckovi'c University of Cambridge
Petar Veličković
Google Deepmind
Neural Algorithmic Reasoning (NAR) is a research area focused on designing neural architectures that can reliably capture classical computat… (voir plus)ion, usually by learning to execute algorithms. A typical approach is to rely on Graph Neural Network (GNN) architectures, which encode inputs in high-dimensional latent spaces that are repeatedly transformed during the execution of the algorithm. In this work we perform a detailed analysis of the structure of the latent space induced by the GNN when executing algorithms. We identify two possible failure modes: (i) loss of resolution, making it hard to distinguish similar values; (ii) inability to deal with values outside the range observed during training. We propose to solve the first issue by relying on a softmax aggregator, and propose to decay the latent space in order to deal with out-of-range values. We show that these changes lead to improvements on the majority of algorithms in the standard CLRS-30 benchmark when using the state-of-the-art Triplet-GMPNN processor. Our code is available at https://github.com/mirjanic/nar-latent-spaces
Many-Shot In-Context Learning
Avi Singh
Lei M Zhang
Bernd Bohnet
Stephanie C.Y. Chan
Ankesh Anand
Zaheer Abbas
Azade Nova
John D Co-Reyes
Eric Chu
Feryal Behbahani
Aleksandra Faust
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (voir plus)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Many-Shot In-Context Learning
Avi Singh
Lei M Zhang
Bernd Bohnet
Stephanie C.Y. Chan
Ankesh Anand
Zaheer Abbas
Azade Nova
John D Co-Reyes
Eric Chu
Feryal Behbahani
Aleksandra Faust
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (voir plus)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
Many-Shot In-Context Learning
Avi Singh
Lei M Zhang
Bernd Bohnet
Stephanie C.Y. Chan
Ankesh Anand
Zaheer Abbas
Azade Nova
John D Co-Reyes
Eric Chu
Feryal Behbahani
Aleksandra Faust
Large language models (LLMs) excel at few-shot in-context learning (ICL) -- learning from a few examples provided in context at inference, w… (voir plus)ithout any weight updates. Newly expanded context windows allow us to investigate ICL with hundreds or thousands of examples -- the many-shot regime. Going from few-shot to many-shot, we observe significant performance gains across a wide variety of generative and discriminative tasks. While promising, many-shot ICL can be bottlenecked by the available amount of human-generated examples. To mitigate this limitation, we explore two new settings: Reinforced and Unsupervised ICL. Reinforced ICL uses model-generated chain-of-thought rationales in place of human examples. Unsupervised ICL removes rationales from the prompt altogether, and prompts the model only with domain-specific questions. We find that both Reinforced and Unsupervised ICL can be quite effective in the many-shot regime, particularly on complex reasoning tasks. Finally, we demonstrate that, unlike few-shot learning, many-shot learning is effective at overriding pretraining biases, can learn high-dimensional functions with numerical inputs, and performs comparably to fine-tuning. We also find that inference cost increases linearly in the many-shot regime, and frontier LLMs benefit from many-shot ICL to varying degrees. Our analysis also reveals the limitations of next-token prediction loss as an indicator of downstream ICL performance.
On the Scalability of GNNs for Molecular Graphs
Maciej Sypetkowski
Frederik Wenkel
Farimah Poursafaei
Nia Dickson
Karush Suri
Philip Fradkin
Scaling deep learning models has been at the heart of recent revolutions in language modelling and image generation. Practitioners have obse… (voir plus)rved a strong relationship between model size, dataset size, and performance. However, structure-based architectures such as Graph Neural Networks (GNNs) are yet to show the benefits of scale mainly due to the lower efficiency of sparse operations, large data requirements, and lack of clarity about the effectiveness of various architectures. We address this drawback of GNNs by studying their scaling behavior. Specifically, we analyze message-passing networks, graph Transformers, and hybrid architectures on the largest public collection of 2D molecular graphs. For the first time, we observe that GNNs benefit tremendously from the increasing scale of depth, width, number of molecules, number of labels, and the diversity in the pretraining datasets. We further demonstrate strong finetuning scaling behavior on 38 highly competitive downstream tasks, outclassing previous large models. This gives rise to MolGPS, a new graph foundation model that allows to navigate the chemical space, outperforming the previous state-of-the-arts on 26 out the 38 downstream tasks. We hope that our work paves the way for an era where foundational GNNs drive pharmaceutical drug discovery.
Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology
Oren Kraus
Kian Kenyon-Dean
Saber Saberian
Maryam Fallah
Peter McLean
Jess Leung
Vasudev Sharma
Ayla Khan
Jia Balakrishnan
Safiye Celik
Maciej Sypetkowski
Chi Vicky Cheng
Kristen Morse
Maureen Makes
Ben Mabey
Berton Earnshaw
Featurizing microscopy images for use in biological research remains a significant challenge, especially for large-scale experiments spannin… (voir plus)g millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks, achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally, we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. Relevant code and select models released with this work can be found at: https://github.com/recursionpharma/maes_microscopy.