Publications

Augmenting Evolutionary Models with Structure-based Retrieval
Yining Huang
Zuobai Zhang
Debora Susan Marks
Pascal Notin
Bias-inducing geometries: exactly solvable data model with fairness implications
Stefano Sarao Mannelli
Federica Gerace
Luca Saglietti
Machine learning (ML) may be oblivious to human bias but it is not immune to its perpetuation. Marginalisation and iniquitous group represen… (see more)tation are often traceable in the very data used for training, and may be reflected or even enhanced by the learning models. In this abstract, we aim to clarify the role played by data geometry in the emergence of ML bias. We introduce an exactly solvable high-dimensional model of data imbalance, where parametric control over the many bias-inducing factors allows for an extensive exploration of the bias inheritance mechanism. Through the tools of statistical physics, we analytically characterise the typical properties of learning models trained in this synthetic framework and obtain exact predictions for the observables that are commonly employed for fairness assessment. Simplifying the nature of the problem to its minimal components, we can retrace and unpack typical unfairness behaviour observed on real-world datasets
Demystifying amortized causal discovery with transformers
Francesco Montagna
Max Cairney-Leeming
Francesco Locatello
Supervised learning approaches for causal discovery from observational data often achieve competitive performance despite seemingly avoiding… (see more) explicit assumptions that traditional methods make for identifiability. In this work, we investigate CSIvA \citep{ke2023learning}, a transformer-based model promising to train on synthetic data and transfer to real data. First, we bridge the gap with existing identifiability theory and show that constraints on the training data distribution implicitly define a prior on the test observations. Consistent with classical approaches, good performance is achieved when we have a good prior on the test data, and the underlying model is identifiable. At the same time, we find new trade-offs. Training on datasets generated from different classes of causal models, unambiguously identifiable in isolation, improves the test generalization. Performance is still guaranteed, as the ambiguous cases resulting from the mixture of identifiable causal models are unlikely to occur (which we formally prove). Overall, our study finds that amortized causal discovery still needs to obey identifiability theory, but it also differs from classical methods in how the assumptions are formulated, trading more reliance on assumptions on the noise type for fewer hypotheses on the mechanisms.
Equivariant Flow Matching for Molecular Conformer Generation
Majdi Hassan
Nikhil Shenoy
Jungyoon Lee
Hannes Stärk
Stephan Thaler
Geometry-Aware Generative Autoencoders for Metric Learning and Generative Modeling on Data Manifolds
Xingzhi Sun
Danqi Liao
Kincaid MacDonald
Yanlei Zhang
Guillaume Huguet
Ian Adelstein
Tim G. J. Rudner
Smita Krishnaswamy
Non-linear dimensionality reduction methods have proven successful at learning low-dimensional representations of high-dimensional point clo… (see more)uds on or near data manifolds. However, existing methods are not easily extensible—that is, for large datasets, it is prohibitively expensive to add new points to these embeddings. As a result, it is very difficult to use existing embeddings generatively, to sample new points on and along these manifolds. In this paper, we propose GAGA (geometry-aware generative autoencoders) a framework which merges the power of generative deep learning with non-linear manifold learning by: 1) learning generalizable geometry-aware neural network embeddings based on non-linear dimensionality reduction methods like PHATE and diffusion maps, 2) deriving a non-euclidean pullback metric on the embedded space to generate points faithfully along manifold geodesics, and 3) learning a flow on the manifold that allows us to transport populations. We provide illustration on easily-interpretable synthetic datasets and showcase results on simulated and real single cell datasets. In particular, we show that the geodesic-based generation can be especially important for scientific datasets where the manifold represents a state space and geodesics can represent dynamics of entities over this space.
Is a Good Description Worth a Thousand Pictures? Reducing Multimodal Alignment to Text-Based, Unimodal Alignment
Amin Memarian
Touraj Laleh
Ardavan S. Nobandegani
Generative AI systems (ChatGPT, Llama, etc.) are increasingly adopted across a range of high-stake domains, including healthcare and crimina… (see more)l justice system. This rapid adoption indeed raises moral and ethical concerns. The emerging field of AI alignment aims to make AI systems that respect human values. In this work, we focus on evaluating the ethics of multimodal AI systems involving both text and images --- a relatively under-explored area, as most alignment work is currently focused on language models. Specifically, here we investigate whether the multimodal alignment problem (i.e., the problem of aligning a multimodal system) could be effectively reduced to the (text-based) unimodal alignment problem, wherein a language model would make a moral judgment purely based on a description of an image. Focusing on GPT-4 and LLaVA as two prominent examples of multimodal systems, here we demonstrate, rather surprisingly, that this reduction can be achieved with a relatively small loss in moral judgment performance in the case of LLaVa, and virtually no loss in the case of GPT-4.
IDs for AI Systems
Alan Chan
Noam Kolt
Peter Wills
Usman Anwar
Christian Schroeder de Witt
Nitarshan Rajkumar
Lewis Hammond
Lennart Heim
Markus Anderljung
Improving Molecular Modeling with Geometric GNNs: an Empirical Study
Ali Ramlaoui
Théo Saulus
Basile Terver
Victor Schmidt
Fragkiskos D. Malliaros
Alexandre AGM Duval
iWISDM: Assessing instruction following in multimodal models at scale
Xiaoxuan Lei
Lucas Gomez
Hao Yuan Bai
The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are no… (see more)t only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.
Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming
João Loula
Katherine M. Collins
Ulrich Schaechtle
Joshua B. Tenenbaum
Adrian Weller
Feras Saad
Vikash Mansinghka
Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, … (see more)such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.
Lost in Translation: The Algorithmic Gap Between LMs and the Brain
Tosato Tommaso
Tikeng Notsawo Pascal Junior
Helbling Saskia
Language Models (LMs) have achieved impressive performance on various linguistic tasks, but their relationship to human language processing … (see more)in the brain remains unclear. This paper examines the gaps and overlaps between LMs and the brain at different levels of analysis, emphasizing the importance of looking beyond input-output behavior to examine and compare the internal processes of these systems. We discuss how insights from neuroscience, such as sparsity, modularity, internal states, and interactive learning, can inform the development of more biologically plausible language models. Furthermore, we explore the role of scaling laws in bridging the gap between LMs and human cognition, highlighting the need for efficiency constraints analogous to those in biological systems. By developing LMs that more closely mimic brain function, we aim to advance both artificial intelligence and our understanding of human cognition.
Meta Flow Matching: Integrating Vector Fields on the Wasserstein Manifold
Lazar Atanackovic
Xi Zhang
Brandon Amos
Leo J Lee
Alexander Tong
Kirill Neklyudov
Numerous biological and physical processes can be modeled as systems of interacting entities evolving continuously over time, e.g. the dynam… (see more)ics of communicating cells or physical particles. Learning the dynamics of such systems is essential for predicting the temporal evolution of populations across novel samples and unseen environments. Flow-based models allow for learning these dynamics at the population level - they model the evolution of the entire distribution of samples. However, current flow-based models are limited to a single initial population and a set of predefined conditions which describe different dynamics. We argue that multiple processes in natural sciences have to be represented as vector fields on the Wasserstein manifold of probability densities. That is, the change of the population at any moment in time depends on the population itself due to the interactions between samples. In particular, this is crucial for personalized medicine where the development of diseases and their respective treatment response depends on the microenvironment of cells specific to each patient. We propose Meta Flow Matching (MFM), a practical approach to integrating along these vector fields on the Wasserstein manifold by amortizing the flow model over the initial populations. Namely, we embed the population of samples using a Graph Neural Network (GNN) and use these embeddings to train a Flow Matching model. This gives MFM the ability to generalize over the initial distributions unlike previously proposed methods. We demonstrate the ability of MFM to improve prediction of individual treatment responses on a large scale multi-patient single-cell drug screen dataset.