Sarath Chandar

Biography

Sarath Chandar is an associate professor at Polytechnique Montreal's Department of Computer and Software Engineering, where he leads the Chandar Research Lab. He is also a Core Academic Member at Mila – Quebec Artificial Intelligence Institute and holds a Canada CIFAR AI Chair and the Canada Research Chair in Lifelong Machine Learning.

Chandar’s research interests include lifelong learning, deep learning, optimization, reinforcement learning and natural language processing. To promote research in lifelong learning, Chandar created the Conference on Lifelong Learning Agents (CoLLAs) in 2022, for which he served as program chair in 2022 and 2023.

He has a PhD from Université de Montréal and an MSc (By Research) from the Indian Institute of Technology Madras.

Current Students

Ista Abbes

Master's Research - Université de Montréal

Alex Aselstyne

Master's Research - Polytechnique Montréal

Davide Baldelli

PhD - Polytechnique Montréal

Co-supervisor :

Milan Bhan

Collaborating researcher

Diego Cerda Mardini

Master's Research - McGill University

Antoine Clavaud

Master's Research - Polytechnique Montréal

Naga Karthik Enamundram

PhD - Polytechnique Montréal

Principal supervisor :

Prashant Govindarajan

PhD - Polytechnique Montréal

Simon Guiroy

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

David Heurtel--Depeiges

PhD - Polytechnique Montréal

Jerry Huang

PhD - Université de Montréal

Saurav Jha

Postdoctorate - Polytechnique Montréal

Amir Kalantari Dehaghi

Collaborating Alumni

Lola Le Breton

PhD - Polytechnique Montréal

Aidan Li

Master's Research - Université de Montréal

Co-supervisor :

Postdoctorate - Université de Montréal

PhD - Polytechnique Montréal

Roshan Munirathinam Sankaran Balaji

Research Intern - Polytechnique Montréal

Rayen Nacef

Collaborating researcher - Polytechnique Montréal

Hadi NekoeiQachkanloo

PhD - Université de Montréal

Nilaksh Nilaksh

PhD - Polytechnique Montréal

PhD - Université de Montréal

Linda Peinthiere

Collaborating researcher - Polytechnique Montréal Montreal

Gabriele Prato

PhD - Université de Montréal

Postdoctorate

Shaipranesh Senthilkumar

PhD - Polytechnique Montréal

Arjun Vaithilingam Sudhakar

Nour Shaheen

Master's Research - Polytechnique Montréal

Principal supervisor :

PhD - Polytechnique Montréal

Megh Thakkar

Master's Research - Université de Montréal

PhD - Polytechnique Montréal

Shawn Whitfield

Collaborating researcher

Kowen Woo

Research Intern - Polytechnique Montréal

Anabel XL

Postdoctorate - Université de Montréal

Abdelrahman Zayed

PhD - Polytechnique Montréal

Xutong Zhao

PhD - Polytechnique Montréal

Artem Zholus

PhD - Polytechnique Montréal

Improving CAD Design With LLMs

Blog Posts

December 19, 2025

Prashant Govindarajan

Davide Baldelli

Quentin Fournier

Sarath Chandar

Read the article

A digital picture of Bert from Sesame street, wering black trench coat and sunglasses

March 3, 2025

NeoBERT: A New Frontier for Open-Source Encoder Language Models

Lola Le Breton

Quentin Fournier

Sarath Chandar

Read the article

October 1, 2024

How Do We Explain AI and Ensure the Explanation Is True? Faithfulness Measurable Models Tell You How

Andrea Madsen

Siva Reddy

Sarath Chandar

Read the article

Publications

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (see more)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

2025-10-08

ArXiv (preprint)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

2025-10-08

ArXiv (preprint)

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

Aman Jaiswal

Patrice Bechard

Oleh Shliazhko

Orlando Marquez Ayala

Massimo Caccia

Alexandre Drouin

Alexandre Lacoste

Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires … (see more)costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.

2025-10-05

ArXiv (preprint)

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

Aman Jaiswal

Patrice Bechard

Oleh Shliazhko

Orlando Marquez Ayala

Massimo Caccia

Alexandre Drouin

Alexandre Lacoste

2025-10-05

ArXiv (preprint)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving th… (see more)eir reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-

2025-09-30

ArXiv (preprint)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

2025-09-30

ArXiv (preprint)

GRPO-$\lambda$: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

2025-09-30

ArXiv (preprint)

CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Prashant Govindarajan

Antoine Clavaud

Mariano Phielipp

Santiago Miret

In silico design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional theor… (see more)y (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose CrystalGym, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark common value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Additionally, we include a case study on the scope of fine-tuning large language models with reinforcement learning for improving DFT-based rewards. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. We therefore introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

2025-09-27

ArXiv (preprint)

Benchmarking Machine Learning Potentials for Crystal Structure Relaxation

Kowen Woo

Prashant Govindarajan

High-throughput materials discovery workflows require rapid and accurate relaxation of crystal structures to identify thermodynamically stab… (see more)le phases among thousands to millions of candidate structures. Yet current machine learning interatomic potential (MLIP) benchmarks focus predominantly on energy prediction rather than structure relaxation, creating a critical evaluation gap for models designed to accelerate optimization. Additionally, these benchmarks are trained on datasets consisting mainly of known stable or near-stable materials, thus failing to capture the challenges of unexplored chemical spaces. We address these limitations by introducing a benchmark that evaluates state-of-the-art MLIPs and a one-shot relaxation model on structure relaxation with crystals generated via a reinforcement learning pipeline. We compare energy lowering and average maximum force computed via DFT, as well as relaxation runtime. We also contrast direct force-prediction strategies against conservative energy-differentiation approaches to determine which paradigm delivers superior relaxation performance. Our results indicate that there is a clear disconnect between MLIP energy prediction and force convergence in relaxation, challenging current benchmarking approaches.

2025-09-24

NeurIPS.cc/2025/Workshop/AI4Science (poster)

openreview.net

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes

Gopeshh Subbaraj

Matthew D Riemer

Nizar Islah

Benjamin Therien

Tsuguchika Tabaru

Hiroaki Kingetsu

Irina Rish

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (published)

doi.org

openreview.net

Parity Requires Unified Input Dependence and Negative Eigenvalues in SSMs

Behnoush Khavari

Jayesh Khullar

Franccois Rivest

Recent work has shown that LRNN models such as S4D, Mamba, and DeltaNet lack state-tracking capability due to either time-invariant transiti… (see more)on matrices or restricted eigenvalue ranges. To address this, input-dependent transition matrices, particularly those that are complex or non-triangular, have been proposed to enhance SSM performance on such tasks. While existing theorems demonstrate that both input-independent and non-negative SSMs are incapable of solving simple state-tracking tasks, such as parity, regardless of depth, they do not explore whether combining these two types in a multilayer SSM could help. We investigate this question for efficient SSMs with diagonal transition matrices and show that such combinations still fail to solve parity. This implies that a recurrence layer must both be input-dependent and include negative eigenvalues. Our experiments support this conclusion by analyzing an SSM model that combines S4D and Mamba layers.

2025-08-10

ArXiv (preprint)

Revisiting Replay and Gradient Alignment for Continual Pre-Training of Large Language Models

Istabrak Abbes

Gopeshh Subbaraj

Matthew D Riemer

Nizar Islah

Benjamin Therien

Tsuguchika Tabaru

Hiroaki Kingetsu

Irina Rish

Training large language models (LLMs) typically involves pre-training on massive corpora, only to restart the process entirely when new data… (see more) becomes available. A more efficient and resource-conserving approach would be continual pre-training, where models are updated with new data rather than retraining from scratch. However, the introduction of new data often causes distribution shifts, leading to performance degradation on previously learned tasks. In this paper, we take a deeper look at two popular proposals for addressing this distribution shift within the continual learning literature: experience replay and gradient alignment. We consider continual pre-training of models within the Llama family of architectures at a large scale across languages with 100 billion tokens of training data in each language, finding that both replay and gradient alignment lead to more stable learning without forgetting. This conclusion holds both as we vary the model scale and as we vary the number and diversity of tasks. Moreover, we are the first to demonstrate the effectiveness of gradient alignment techniques in the context of LLM pre-training and propose an efficient implementation of meta-experience replay (MER) that imbues experience replay with the benefits of gradient alignment despite negligible compute and memory overhead. Our scaling analysis across model sizes and replay rates indicates that small rates of replaying old examples are definitely a more valuable use of compute than investing in model size, but that it is more compute efficient to scale the size of the model than invest in high rates of replaying old examples.

2025-08-03

ArXiv (preprint)