Publications

Training Dynamics Underlying Language Model Scaling Laws: Loss Deceleration and Zero-Sum Learning

Andrei Mircea

Supriyo Chakraborty

Nima Chitsazan

Ekaterina Lobacheva

This work aims to understand how scaling improves language models, specifically in terms of training dynamics. We find that language models … (voir plus)undergo loss deceleration early in training; an abrupt slowdown in the rate of loss improvement, resulting in piecewise linear behaviour of the loss curve in log-log space. Scaling up the model mitigates this transition by (1) decreasing the loss at which deceleration occurs, and (2) improving the log-log rate of loss improvement after deceleration. We attribute loss deceleration to a type of degenerate training dynamics we term zero-sum learning (ZSL). In ZSL, per-example gradients become systematically opposed, leading to destructive interference in per-example changes in loss. As a result, improving loss on one subset of examples degrades it on another, bottlenecking overall progress. Loss deceleration and ZSL provide new insights into the training dynamics underlying language model scaling laws, and could potentially be targeted directly to improve language models independent of scale. We make our code and artefacts available at: https://github.com/mirandrom/zsl

2025-06-01

arXiv (publié)

doi.org

arxiv.org

Unpacking Softmax: How Temperature Drives Representation Collapse, Compression, and Generalization

Wojciech Masarczyk

Mateusz Ostaszewski

Tin Sum Cheng

Tomasz Trzci'nski

Aurélien Lucchi

Razvan Pascanu

The softmax function is a fundamental building block of deep neural networks, commonly used to define output distributions in classification… (voir plus) tasks or attention weights in transformer architectures. Despite its widespread use and proven effectiveness, its influence on learning dynamics and learned representations remains poorly understood, limiting our ability to optimize model behavior. In this paper, we study the pivotal role of the softmax function in shaping the model's representation. We introduce the concept of rank deficit bias - a phenomenon in which softmax-based deep networks find solutions of rank much lower than the number of classes. This bias depends on the softmax function's logits norm, which is implicitly influenced by hyperparameters or directly modified by softmax temperature. Furthermore, we demonstrate how to exploit the softmax dynamics to learn compressed representations or to enhance their performance on out-of-distribution data. We validate our findings across diverse architectures and real-world datasets, highlighting the broad applicability of temperature tuning in improving model performance. Our work provides new insights into the mechanisms of softmax, enabling better control over representation learning in deep neural networks.

2025-06-01

arXiv (publié)

doi.org

arxiv.org

Veracity: An Open-Source AI Fact-Checking System

Taylor Lynn Curtis

Maximilian Puelma Touzel

William Garneau

Manon Gruaz

Mike Pinder

Li Wei Wang

Sukanya Krishna

Luda Cohen

Jean-François Godbout

Reihaneh Rabbany

Kellin Pelrine

The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper… (voir plus) introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.

2025-06-01

arXiv (publié)

doi.org

arxiv.org

Weak Supervision for Real World Graphs

Pratheeksha Nair

Reihaneh Rabbany

2025-06-01

arXiv (publié)

doi.org

arxiv.org

Graph Representation Learning for the Prediction of Medication Usage in the UK Biobank Based on Pharmacogenetic Variants

Bill Qi

Yannis Trakadis

2025-05-31

Bioengineering (publié)

doi.org

Continual Learning in Vision-Language Models via Aligned Model Merging

Ghada Sokar

Gintare Karolina Dziugaite

Anurag Arnab

Ahmet Iscen

Pablo Samuel Castro

Cordelia Schmid

Continual learning is conventionally tackled through sequential fine-tuning, a process that, while enabling adaptation, inherently favors pl… (voir plus)asticity over the stability needed to retain prior knowledge. While existing approaches attempt to mitigate catastrophic forgetting, a bias towards recent tasks persists as they build upon this sequential nature. In this work we present a new perspective based on model merging to maintain stability while still retaining plasticity. Rather than just sequentially updating the model weights, we propose merging newly trained task parameters with previously learned ones, promoting a better balance. To maximize the effectiveness of the merging process, we propose a simple mechanism that promotes learning aligned weights with previous ones, thereby avoiding interference when merging. We evaluate this approach on large Vision-Language Models (VLMs), and demonstrate its effectiveness in reducing forgetting, increasing robustness to various task orders and similarities, and improving generalization.

2025-05-30

ArXiv (prépublication)

arxiv.org

Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Anthony Gosselin

Ge Ya Luo

Luis Lara

Florian Golemo

Derek Nowrouzezahrai

Liam Paull

Alexia Jolicoeur-Martineau

Chris Pal

Video diffusion techniques have advanced significantly in recent years; however, they struggle to generate realistic imagery of car crashes … (voir plus)due to the scarcity of accident events in most driving datasets. Improving traffic safety requires realistic and controllable accident simulations. To tackle the problem, we propose Ctrl-Crash, a controllable car crash video generation model that conditions on signals such as bounding boxes, crash types, and an initial image frame. Our approach enables counterfactual scenario generation where minor variations in input can lead to dramatically different crash outcomes. To support fine-grained control at inference time, we leverage classifier-free guidance with independently tunable scales for each conditioning signal. Ctrl-Crash achieves state-of-the-art performance across quantitative video quality metrics (e.g., FVD and JEDi) and qualitative measurements based on a human-evaluation of physical realism and video quality compared to prior diffusion-based methods.

2025-05-30

ArXiv (prépublication)

arxiv.org

Long Range Navigator (LRN): Extending robot planning horizons beyond metric maps

Matt Schmittle

Rohan Baijal

Nathan Hatch

Rosario Scalise

Mateo Guaman Castro

Sidharth Talia

Khimya Khetarpal

Byron Boots

Siddhartha Srinivasa

2025-05-30

roboticsfoundation.org/RSS/2025/Workshop/ROAR (présentation orale)

doi.org

openreview.net

Gravitational-Wave Parameter Estimation in non-Gaussian noise using Score-Based Likelihood Characterization

Ronan Legin

Maximiliano Isi

Kaze W. K. Wong

Yashar Hezaveh

Laurence Perreault-Levasseur

2025-05-29

The Astrophysical Journal Letters (publié)

doi.org

arxiv.org

Nuclear Patterning of Developing Cells in Murine Ventricular Heart Walls

Tabish A Syed

Sophie Ellwood

Drisya Dileep

S. Subha

Minhajuddin Sirajuddin

Kaleem Siddiqi

2025-05-29

Lecture Notes in Computer Science (publié)

doi.org

Calibrated Value-Aware Model Learning with Stochastic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Romina Abachi

Igor Gilitschenski

Amir-massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-05-28

ArXiv (prépublication)

doi.org

arxiv.org

Calibrated Value-Aware Model Learning with Probabilistic Environment Models

Claas Voelcker

Anastasiia Pedan

Arash Ahmadian

Romina Abachi

Igor Gilitschenski

Amir-massoud Farahmand

The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.

2025-05-28

ArXiv (prépublication)

arxiv.org

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications