Publications

Reducing Two-Way Ranging Variance by Signal-Timing Optimization
Mohammed Ayman Shalaby
Charles Champagne Cossette
Jerome Le Ny
Time-of-flight-based ranging among transceivers with different clocks requires protocols that accommodate varying rates of the clocks. Doubl… (see more)e-sided two-way ranging (DS-TWR) is widely adopted as a standard protocol due to its accuracy; however, the precision of DS-TWR has not been clearly addressed. In this paper, an analytical model of the variance of DS-TWR is derived as a function of the user-programmed response delays, which is then compared to the Cramer-Rao Lower Bound (CRLB). This is then used to formulate an optimization problem over the response delays in order to maximize the information gained from range measurements. The derived analytical variance model and optimized protocol are validated experimentally with 2 ranging UWB transceivers, where 29 million range measurements are collected.
A responsible framework for applying artificial intelligence on medical images and signals at the point-of-care: the PACS-AI platform.
Pascal Thériault-Lauzier
Denis Cobin
Olivier Tastet
Élodie Labrecque Langlais
B. Taji
Guson Kang
A. Chong
Derek So
An Tang
Judy Wawira Gichoya
A. Chandar
Pierre-Luc Deziel
Julie G Hussin
Samuel Kadoury
Robert Avram
Revisiting the 2023 wildfire season in Canada
Flavie Pelletier
Jeffrey A. Cardille
Michael A. Wulder
Joanne C. White
Txomin Hermosilla
State Soup: In-Context Skill Learning, Retrieval and Mixing
Maciej Pi'oro
Maciej Wolczyk
Johannes Von Oswald
João Sacramento
A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Suc… (see more)h models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.
Stimulus information guides the emergence of behavior-related signals in primary somatosensory cortex during learning
Mariangela Panniello
Colleen J. Gillon
Roberto Maffulli
Marco Celotto
Blake A. Richards
Stefano Panzeri
Michael M. Kohl
Neurons in the primary cortex carry sensory- and behavior-related information, but it remains an open question how this information emerges … (see more)and intersects together during learning. Current evidence points to two possible learning-related changes: sensory information increases in the primary cortex or sensory information remains stable, but its readout efficiency in association cortices increases. We investigated this question by imaging neuronal activity in mouse primary somatosensory cortex before, during, and after learning of an object localization task. We quantified sensory- and behavior-related information and estimated how much sensory information was used to instruct perceptual choices as learning progressed. We find that sensory information increases from the start of training, while choice information is mostly present in the later stages of learning. Additionally, the readout of sensory information becomes more efficient with learning as early as in the primary sensory cortex. Together, our results highlight the importance of primary cortical neurons in perceptual learning.
Transformers meet Neural Algorithmic Reasoners
Wilfried Bounsi
Borja Ibarz
Andrew Joseph Dudzik
Jessica B. Hamrick
Larisa Markeeva
Alex Vitvitskyi
Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text da… (see more)tasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution.
Transformers need glasses! Information over-squashing in language tasks
Federico Barbero
Andrea Banino
Steven Kapturowski
Dharshan Kumaran
João Guilherme Madeira Araújo
Alex Vitvitskyi
We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large langu… (see more)age models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
Climate Variable Downscaling with Conditional Normalizing Flows
Predictions of global climate models typically operate on coarse spatial scales due to the large computational costs of climate simulations.… (see more) This has led to a considerable interest in methods for statistical downscaling, a similar process to super-resolution in the computer vision context, to provide more local and regional climate information. In this work, we apply conditional normalizing flows to the task of climate variable downscaling. We showcase its successful performance on an ERA5 water content dataset for different upsampling factors. Additionally, we show that the method allows us to assess the predictive uncertainty in terms of standard deviation from the fitted conditional distribution mean.
How well do models of visual cortex generalize to out of distribution samples?
Yifei Ren
On shallow planning under partial observability
On the Costs and Benefits of Adopting Lifelong Learning for Software Analytics -- Empirical Study on Brown Build and Risk Prediction
Doriane Olewicki
Sarra Habchi
Mathieu Nayrolles
A. Chandar
Bram Adams
Nowadays, software analytics tools using machine learning (ML) models to, for example, predict the risk of a code change are well establishe… (see more)d. However, as the goals of a project shift over time, and developers and their habits change, the performance of said models tends to degrade (drift) over time. Current retraining practices typically require retraining a new model from scratch on a large updated dataset when performance decay is observed, thus incurring a computational cost; also there is no continuity between the models as the past model is discarded and ignored during the new model training. Even though the literature has taken interest in online learning approaches, those have rarely been integrated and evaluated in industrial environments. This paper evaluates the use of lifelong learning (LL) for industrial use cases at Ubisoft, evaluating both the performance and the required computational effort in comparison to the retraining-from-scratch approaches commonly used by the industry. LL is used to continuously build and maintain ML-based software analytics tools using an incremental learner that progressively updates the old model using new data. To avoid so-called"catastrophic forgetting"of important older data points, we adopt a replay buffer of older data, which still allows us to drastically reduce the size of the overall training dataset, and hence model training time.
Deep Grokking: Would Deep Neural Networks Generalize Better?
Simin Fan
Martin Jaggi
Recent research on the grokking phenomenon has illuminated the intricacies of neural networks' training dynamics and their generalization be… (see more)haviors. Grokking refers to a sharp rise of the network's generalization accuracy on the test set, which occurs long after an extended overfitting phase, during which the network perfectly fits the training set. While the existing research primarily focus on shallow networks such as 2-layer MLP and 1-layer Transformer, we explore grokking on deep networks (e.g. 12-layer MLP). We empirically replicate the phenomenon and find that deep neural networks can be more susceptible to grokking than its shallower counterparts. Meanwhile, we observe an intriguing multi-stage generalization phenomenon when increase the depth of the MLP model where the test accuracy exhibits a secondary surge, which is scarcely seen on shallow models. We further uncover compelling correspondences between the decreasing of feature ranks and the phase transition from overfitting to the generalization stage during grokking. Additionally, we find that the multi-stage generalization phenomenon often aligns with a double-descent pattern in feature ranks. These observations suggest that internal feature rank could serve as a more promising indicator of the model's generalization behavior compared to the weight-norm. We believe our work is the first one to dive into grokking in deep neural networks, and investigate the relationship of feature rank and generalization performance.