Publications

A Complexity-Based Theory of Compositionality

Eric Elmoznino

Thomas Jiralerspong

Compositionality is believed to be fundamental to intelligence. In humans, it underlies the structure of thought, language, and higher-level… (see more) reasoning. In AI, compositional representations can enable a powerful form of out-of-distribution generalization, in which a model systematically adapts to novel combinations of known concepts. However, while we have strong intuitions about what compositionality is, there currently exists no formal definition for it that is measurable and mathematical. Here, we propose such a definition, which we call representational compositionality, that accounts for and extends our intuitions about compositionality. The definition is conceptually simple, quantitative, grounded in algorithmic information theory, and applicable to any representation. Intuitively, representational compositionality states that a compositional representation satisfies three properties. First, it must be expressive. Second, it must be possible to re-describe the representation as a function of discrete symbolic sequences with re-combinable parts, analogous to sentences in natural language. Third, the function that relates these symbolic sequences to the representation, analogous to semantics in natural language, must be simple. Through experiments on both synthetic and real world data, we validate our definition of compositionality and show how it unifies disparate intuitions from across the literature in both AI and cognitive science. We also show that representational compositionality, while theoretically intractable, can be readily estimated using standard deep learning tools. Our definition has the potential to inspire the design of novel, theoretically-driven models that better capture the mechanisms of compositional thought.

2024-10-18

ArXiv (preprint)

doi.org

arxiv.org

Convergence of Manifold Filter-Combine Networks

David R. Johnson

Joyce A. Chew

Siddharth Viswanath

Edward De Brouwer

Deanna Needell

Smita Krishnaswamy

Michael Perlmutter

In order to better understand manifold neural networks (MNNs), we introduce Manifold Filter-Combine Networks (MFCNs). The filter-combine fra… (see more)mework parallels the popular aggregate-combine paradigm for graph neural networks (GNNs) and naturally suggests many interesting families of MNNs which can be interpreted as the manifold analog of various popular GNNs. We then propose a method for implementing MFCNs on high-dimensional point clouds that relies on approximating the manifold by a sparse graph. We prove that our method is consistent in the sense that it converges to a continuum limit as the number of data points tends to infinity.

2024-10-18

ArXiv (preprint)

doi.org

arxiv.org

Convergence of Manifold Filter-Combine Networks

David R. Johnson

Joyce Chew

Siddharth Viswanath

Edward De Brouwer

Deanna Needell

Smita Krishnaswamy

Michael Perlmutter

In order to better understand manifold neural networks (MNNs), we introduce Manifold Filter-Combine Networks (MFCNs). The filter-combine fra… (see more)mework parallels the popular aggregate-combine paradigm for graph neural networks (GNNs) and naturally suggests many interesting families of MNNs which can be interpreted as the manifold analog of various popular GNNs. We then propose a method for implementing MFCNs on high-dimensional point clouds that relies on approximating the manifold by a sparse graph. We prove that our method is consistent in the sense that it converges to a continuum limit as the number of data points tends to infinity.

2024-10-18

ArXiv (preprint)

doi.org

arxiv.org

In-context learning and Occam's razor

Eric Elmoznino

Tom Marty

Tejas Kasetty

Leo Gagnon

Sarthak Mittal

Mahan Fathi

Dhanya Sridhar

Guillaume Lajoie

A central goal of machine learning is generalization. While the No Free Lunch Theorem states that we cannot obtain theoretical guarantees fo… (see more)r generalization without further assumptions, in practice we observe that simple models which explain the training data generalize best: a principle called Occam's razor. Despite the need for simple models, most current approaches in machine learning only minimize the training error, and at best indirectly promote simplicity through regularization or architecture design. Here, we draw a connection between Occam's razor and in-context learning: an emergent ability of certain sequence models like Transformers to learn at inference time from past observations in a sequence. In particular, we show that the next-token prediction loss used to train in-context learners is directly equivalent to a data compression technique called prequential coding, and that minimizing this loss amounts to jointly minimizing both the training error and the complexity of the model that was implicitly learned from context. Our theory and the empirical experiments we use to support it not only provide a normative account of in-context learning, but also elucidate the shortcomings of current in-context learning methods, suggesting ways in which they can be improved. We make our code available at https://github.com/3rdCore/PrequentialCode.

2024-10-17

ArXiv (preprint)

doi.org

openreview.net

MixEHR-Nest: Identifying Subphenotypes within Electronic Health Records through Hierarchical Guided-Topic Modeling

Ruohan Wang

Zilong Wang

Ziyang Song

David Buckeridge

Yue Li

Automatic subphenotyping from electronic health records (EHRs)provides numerous opportunities to understand diseases with unique subgroups a… (see more)nd enhance personalized medicine for patients. However, existing machine learning algorithms either focus on specific diseases for better interpretability or produce coarse-grained phenotype topics without considering nuanced disease patterns. In this study, we propose a guided topic model, MixEHR-Nest, to infer sub-phenotype topics from thousands of disease using multi-modal EHR data. Specifically, MixEHR-Nest detects multiple subtopics from each phenotype topic, whose prior is guided by the expert-curated phenotype concepts such as Phenotype Codes (PheCodes) or Clinical Classification Software (CCS) codes. We evaluated MixEHR-Nest on two EHR datasets: (1) the MIMIC-III dataset consisting of over 38 thousand patients from intensive care unit (ICU) from Beth Israel Deaconess Medical Center (BIDMC) in Boston, USA; (2) the healthcare administrative database PopHR, comprising 1.3 million patients from Montreal, Canada. Experimental results demonstrate that MixEHR-Nest can identify subphenotypes with distinct patterns within each phenotype, which are predictive for disease progression and severity. Consequently, MixEHR-Nest distinguishes between type 1 and type 2 diabetes by inferring subphenotypes using CCS codes, which do not differentiate these two subtype concepts. Additionally, MixEHR-Nest not only improved the prediction accuracy of short-term mortality of ICU patients and initial insulin treatment in diabetic patients but also revealed the contributions of subphenotypes. For longitudinal analysis, MixEHR-Nest identified subphenotypes of distinct age prevalence under the same phenotypes, such as asthma, leukemia, epilepsy, and depression. The MixEHR-Nest software is available at GitHub: https://github.com/li-lab-mcgill/MixEHR-Nest.

2024-10-17

ArXiv (preprint)

doi.org

arxiv.org

A Simulation System Towards Solving Societal-Scale Manipulation

Maximilian Puelma Touzel

Sneheel Sarangi

Austin Welch

Gayatri Krishnakumar

Dan Zhao

Zachary Yang

Hao Yu

Ethan Kosak-Hine

Tom Gibbs

Andreea Musulan

Camille Thibault

Busra Tugce Gurbuz

Reihaneh Rabbany

Jean-François Godbout

Kellin Pelrine

The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-w… (see more)orld settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents' political positions and show how partisan manipulation of agents can affect election results.

2024-10-17

ArXiv (preprint)

doi.org

arxiv.org

A Simulation System Towards Solving Societal-Scale Manipulation

Maximilian Puelma Touzel

Sneheel Sarangi

Austin Welch

Gayatri K

Dan Zhao

Zachary Yang

Hao Yu

Ethan Kosak-Hine

Tom Gibbs

Andreea Musulan

Camille Thibault

Busra Tugce Gurbuz

Reihaneh Rabbany

Jean-François Godbout

Kellin Pelrine

The rise of AI-driven manipulation poses significant risks to societal trust and democratic processes. Yet, studying these effects in real-w… (see more)orld settings at scale is ethically and logistically impractical, highlighting a need for simulation tools that can model these dynamics in controlled settings to enable experimentation with possible defenses. We present a simulation environment designed to address this. We elaborate upon the Concordia framework that simulates offline, `real life' activity by adding online interactions to the simulation through social media with the integration of a Mastodon server. We improve simulation efficiency and information flow, and add a set of measurement tools, particularly longitudinal surveys. We demonstrate the simulator with a tailored example in which we track agents' political positions and show how partisan manipulation of agents can affect election results.

2024-10-17

ArXiv (preprint)

doi.org

arxiv.org

Training Compute-Optimal Vision Transformers for Brain Encoding

Sana Ahmadi

Fraçois Paugam

Tristan Glatard

Lune Bellec

The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources… (see more). This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.

2024-10-17

ArXiv (preprint)

arxiv.org

Training Compute-Optimal Vision Transformers for Brain Encoding

Sana Ahmadi

Fraçois Paugam

Tristan Glatard

Lune Bellec

The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources… (see more). This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.

2024-10-17

ArXiv (preprint)

doi.org

arxiv.org

BlabberSeg: Real-Time Embedded Open-Vocabulary Aerial Segmentation

Haechan Mark Bong

Ricardo de Azambuja

Giovanni Beltrame

Real-time aerial image segmentation plays an important role in the environmental perception of Uncrewed Aerial Vehicles (UAVs). We introduce… (see more) BlabberSeg, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs. BlabberSeg improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation. We validated BlabberSeg in a safe landing scenario using the Dynamic Open-Vocabulary Enhanced SafE-Landing with Intelligence (DOVESEI) framework, which uses visual servoing and open-vocabulary segmentation. BlabberSeg reduces computational costs significantly, with a speed increase of 927.41% (16.78 Hz) on a NVIDIA Jetson Orin AGX (64GB) compared with the original CLIPSeg (1.81Hz), achieving real-time aerial segmentation with negligible loss in accuracy (2.1% as the ratio of the correctly segmented area with respect to CLIPSeg). BlabberSeg's source code is open and available online.

2024-10-16

ArXiv (preprint)

doi.org

arxiv.org

Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization

Phillip Huang Guo

Aaquib Syed

Abhay Sheshadri

Aidan Ewart

Gintare Karolina Dziugaite

2024-10-16

ArXiv (preprint)

doi.org

openreview.net

The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse

Ekansh Sharma

Daniel M. Roy

Gintare Karolina Dziugaite

Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task m… (see more)odel, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods -- such as Task Arithmetic, TIES-merging, and TALL mask merging -- work well to combine expert models obtained by fine-tuning a common foundation model, operating within a"local"neighborhood of the foundation model. This work explores the more challenging scenario of"non-local"merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to"variance collapse", a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem.

2024-10-16

ArXiv (preprint)

doi.org

arxiv.org

Rising to the Occasion

Mila Techaide 2025

The Development of the UN Scientific Panel on AI

Transition in Mila's Scientific Direction

AI Insights for Policymakers

Rising to the Occasion

Mila Techaide 2025

Publications

Rising to the Occasion

Mila Techaide 2025

The Development of the UN Scientific Panel on AI

Transition in Mila's Scientific Direction

AI Insights for Policymakers

Rising to the Occasion

Mila Techaide 2025

Popular keywords:

Publications