Sarath Chandar

Biographie

Sarath Chandar est professeur associé au départment de génie informatique et génie logiciel de Polytechnique Montréal, où il dirige le laboratoire de recherche Chandar. Il est également membre académique principal à Mila – Institut québécois d’intelligence artificielle, et titulaire d'une chaire en IA Canada-CIFAR et d'une Chaire de recherche du Canada en apprentissage machine permanent.

Ses recherches portent sur l'apprentissage tout au long de la vie, l'apprentissage profond, l'optimisation, l'apprentissage par renforcement et le traitement du langage naturel. Pour promouvoir la recherche sur l'apprentissage tout au long de la vie, Sarath Chandar a créé la Conférence sur les agents d'apprentissage tout au long de la vie (CoLLAs) en 2022 et a présidé le programme en 2022 et en 2023. Il est titulaire d'un doctorat de l'Université de Montréal et d'une maîtrise en recherche de l'Indian Institute of Technology Madras.

Étudiants actuels

Ista Abbes

Maîtrise recherche - UdeM

Alex Aselstyne

Maîtrise recherche - Polytechnique

Davide Baldelli

Doctorat - Polytechnique

Co-superviseur⋅e :

Milan Bhan

Collaborateur·rice de recherche

Diego Cerda Mardini

Maîtrise recherche - McGill

Antoine Clavaud

Maîtrise recherche - Polytechnique

Naga Karthik Enamundram

Doctorat - Polytechnique

Superviseur⋅e principal⋅e :

Prashant Govindarajan

Doctorat - Polytechnique

Simon Guiroy

Doctorat - UdeM

Superviseur⋅e principal⋅e :

Doctorat - UdeM

David Heurtel--Depeiges

Doctorat - Polytechnique

Jerry Huang

Doctorat - UdeM

Saurav Jha

Postdoctorat - Polytechnique

Amir Kalantari Dehaghi

Collaborateur·rice alumni

Lola Le Breton

Doctorat - Polytechnique

Aidan Li

Maîtrise recherche - UdeM

Co-superviseur⋅e :

Postdoctorat - UdeM

Doctorat - Polytechnique

Roshan Munirathinam Sankaran Balaji

Stagiaire de recherche - Polytechnique

Rayen Nacef

Collaborateur·rice de recherche - Polytechnique

Hadi NekoeiQachkanloo

Doctorat - UdeM

Nilaksh Nilaksh

Doctorat - Polytechnique

Doctorat - UdeM

Collaborateur·rice de recherche - Polytechnique Montreal

Doctorat - UdeM

Postdoctorat

Shaipranesh Senthilkumar

Doctorat - Polytechnique

Arjun Vaithilingam Sudhakar

Nour Shaheen

Maîtrise recherche - Polytechnique

Superviseur⋅e principal⋅e :

Doctorat - Polytechnique

Megh Thakkar

Maîtrise recherche - UdeM

Doctorat - Polytechnique

Shawn Whitfield

Collaborateur·rice de recherche

Kowen Woo

Stagiaire de recherche - Polytechnique

Anabel XL

Postdoctorat - UdeM

Abdelrahman Zayed

Doctorat - Polytechnique

Xutong Zhao

Doctorat - Polytechnique

Artem Zholus

Doctorat - Polytechnique

NeoBERT: une nouvelle frontière pour les modèles de langage encodeurs open-source

Billets de blogue

A digital picture of Bert from Sesame street, wering black trench coat and sunglasses

3 mars 2025

par

Lola Le Breton

Quentin Fournier

Sarath Chandar

Lire l'article

1 octobre 2024

Comment expliquer l’IA et s’assurer que cette explication est vraie? Les modèles mesurables de fidélité vous indiquent comment y parvenir

par

Andrea Madsen

Siva Reddy

Sarath Chandar

Lire l'article

Publications

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (voir plus)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

2025-10-08

ArXiv (prépublication)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

2025-10-08

ArXiv (prépublication)

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

Aman Jaiswal

Patrice Béchard

Oleh Shliazhko

Orlando Marquez Ayala

Massimo Caccia

Alexandre Drouin

Alexandre Lacoste

Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires … (voir plus)costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

2025-10-05

ArXiv (prépublication)

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

Aman Jaiswal

Patrice Béchard

Oleh Shliazhko

Orlando Marquez Ayala

Massimo Caccia

Alexandre Drouin

Alexandre Lacoste

2025-10-05

ArXiv (prépublication)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving th… (voir plus)eir reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-

2025-09-30

ArXiv (prépublication)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

2025-09-30

ArXiv (prépublication)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

2025-09-30

ArXiv (prépublication)

CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Prashant Govindarajan

Antoine Clavaud

Mariano Phielipp

Santiago Miret

In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theor… (voir plus)y (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

2025-09-27

ArXiv (prépublication)

Benchmarking Machine Learning Potentials for Crystal Structure Relaxation

Kowen Woo

Prashant Govindarajan

High-throughput materials discovery workflows require rapid and accurate relaxation of crystal structures to identify thermodynamically stab… (voir plus)le phases among thousands to millions of candidate structures. Yet current machine learning interatomic potential (MLIP) benchmarks focus predominantly on energy prediction rather than structure relaxation, creating a critical evaluation gap for models designed to accelerate optimization. Additionally, these benchmarks are trained on datasets consisting mainly of known stable or near-stable materials, thus failing to capture the challenges of unexplored chemical spaces. We address these limitations by introducing a benchmark that evaluates state-of-the-art MLIPs and a one-shot relaxation model on structure relaxation with crystals generated via a reinforcement learning pipeline. We compare energy lowering and average maximum force computed via DFT, as well as relaxation runtime. We also contrast direct force-prediction strategies against conservative energy-differentiation approaches to determine which paradigm delivers superior relaxation performance. Our results indicate that there is a clear disconnect between MLIP energy prediction and force convergence in relaxation, challenging current benchmarking approaches.

2025-09-24

NeurIPS.cc/2025/Workshop/AI4Science (poster)

openreview.net

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes

Gopeshh Subbaraj

Matthew D Riemer

Nizar Islah

Benjamin Therien

Tsuguchika Tabaru

Hiroaki Kingetsu

Irina Rish

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (publié)

doi.org

openreview.net

Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

Jayesh Khullar

Franccois Rivest

Recent work has shown that LRNN models such as S4D, Mamba, and DeltaNet lack state-tracking capability due to either time-invariant transiti… (voir plus)on matrices or restricted eigenvalue ranges. To address this, input-dependent transition matrices, particularly those that are complex or non-triangular, have been proposed to enhance SSM performance on such tasks. While existing theorems demonstrate that both input-independent and non-negative SSMs are incapable of solving simple state-tracking tasks, such as parity, regardless of depth, they do not explore whether combining these two types in a multilayer SSM could help. We investigate this question for efficient SSMs with diagonal transition matrices and show that such combinations still fail to solve parity. This implies that a recurrence layer must both be input-dependent and include negative eigenvalues. Our experiments support this conclusion by analyzing an SSM model that combines S4D and Mamba layers.

2025-08-10

ArXiv (prépublication)

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes

Gopeshh Subbaraj

Matthew D Riemer

Nizar Islah

Benjamin Therien

Tsuguchika Tabaru

Hiroaki Kingetsu

Irina Rish

Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data… (voir plus) becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

2025-08-03

ArXiv (prépublication)