Portrait de Sarath Chandar

Sarath Chandar

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur associé, Polytechnique Montréal, Département d'informatique et de génie logiciel
Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle
Indian Institute of Technology Madras
Sujets de recherche
Alignement de l'IA
Apprentissage automatique médical
Apprentissage de représentations
Apprentissage en ligne
Apprentissage par renforcement
Apprentissage par transfert
Apprentissage profond
Apprentissage tout au long de la vie
Grands modèles de langage (LLM)
IA digne de confiance
Interprétabilité
Modèles de fondation
Optimisation
Réseaux de neurones récurrents
Systèmes multi-agents
Traitement du langage naturel
XAI (IA explicable)

Biographie

Sarath Chandar est professeur associé au départment de génie informatique et génie logiciel de Polytechnique Montréal, où il dirige le laboratoire de recherche Chandar. Il est également membre académique principal à Mila – Institut québécois d’intelligence artificielle, et titulaire d'une chaire en IA Canada-CIFAR et d'une Chaire de recherche du Canada en apprentissage machine permanent.

Ses recherches portent sur l'apprentissage tout au long de la vie, l'apprentissage profond, l'optimisation, l'apprentissage par renforcement et le traitement du langage naturel. Pour promouvoir la recherche sur l'apprentissage tout au long de la vie, Sarath Chandar a créé la Conférence sur les agents d'apprentissage tout au long de la vie (CoLLAs) en 2022 et a présidé le programme en 2022 et en 2023. Il est titulaire d'un doctorat de l'Université de Montréal et d'une maîtrise en recherche de l'Indian Institute of Technology Madras.

Étudiants actuels

Maîtrise recherche - UdeM
Maîtrise recherche - Polytechnique
Doctorat - Polytechnique
Co-superviseur⋅e :
Collaborateur·rice de recherche
Maîtrise recherche - McGill
Maîtrise recherche - Polytechnique
Doctorat - Polytechnique
Superviseur⋅e principal⋅e :
Doctorat - Polytechnique
Doctorat - UdeM
Superviseur⋅e principal⋅e :
Collaborateur·rice de recherche
Superviseur⋅e principal⋅e :
Doctorat - UdeM
Postdoctorat - Polytechnique
Collaborateur·rice alumni
Doctorat - Polytechnique
Maîtrise recherche - UdeM
Co-superviseur⋅e :
Doctorat - Polytechnique
Collaborateur·rice de recherche - Polytechnique
Doctorat - UdeM
Doctorat - Polytechnique
Doctorat - UdeM
Collaborateur·rice de recherche - Polytechnique Montreal
Maîtrise recherche - Polytechnique
Collaborateur·rice alumni
Doctorat - Polytechnique
Maîtrise recherche - Polytechnique
Superviseur⋅e principal⋅e :
Doctorat - Polytechnique
Postdoctorat - UdeM
Maîtrise recherche - UdeM
Doctorat - Polytechnique
Collaborateur·rice de recherche
Doctorat - Polytechnique
Doctorat - Polytechnique
Doctorat - Polytechnique

Publications

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning
Nilaksh
Franccois Rivest
AI Institute
Polytechnique Montr ´ eal
Monitoring morphometric drift in lifelong learning segmentation of the spinal cord.
Enamundram Naga Karthik
Christoph Stefan Aigner
Elise Bannier
Josef Bednařík
Virginie Callot
Anna Combes
Armin Curt
Gergely Dávid
Falk Eippert
Lynn Farner
Michael G. Fehlings
Patrick Freund
Tobias Granberg
Cristina Granziera
RHSCIR Network Imaging Group
Ulrike Horn
Tomáš Horák
Suzanne Humphreys … (voir 36 de plus)
Markus Hupp
Anne Kerbrat
Nawal Kinany
Shannon Kolind
Petr Kudlička
Anna Lebret
Lisa Eunyoung Lee
Cristina Granziera
Allan R. Martin
Govind Nair
Megan McGrath
Kristin P. O’Grady
Jiwon Oh
Russell Ouellette
Nikolai Pfender
Dario Pfyffer
Pierre‐François Pradat
Alexandre Prat
Alexandre Prat
Daniel S. Reich
Ilaria Ricchi
Naama Rotem‐Kohavi
Simon Schading-Sassenhausen
Maryam Seif
Andrew Smith
Seth A. Smith
Grace Sweeney
Roger Tam
Anthony Traboulsee
Constantina A. Treaba
Charidimos Tsagkas
Dimitri Van De Ville
Zachary Vavasour
Kenneth A. Weber
Morphometric measures derived from spinal cord segmentations can serve as diagnostic and prognostic biomarkers in neurological diseases and … (voir plus)injuries affecting the spinal cord. For instance, the spinal cord cross-sectional area can be used to monitor cord atrophy in multiple sclerosis and to characterize compression in degenerative cervical myelopathy. While robust, automatic segmentation methods to a wide variety of contrasts and pathologies have been developed over the past few years, whether their predictions are stable as the model is updated using new datasets has not been assessed. This is particularly important for deriving normative values from healthy participants. In this study, we present a spinal cord segmentation model trained on a multisite (n=75) dataset, including 9 different MRI contrasts and several spinal cord pathologies. We also introduce a lifelong learning framework to automatically monitor the morphometric drift as the model is updated using additional datasets. The framework is triggered by an automatic GitHub Actions workflow every time a new model is created, recording the morphometric values derived from the model's predictions over time. As a real-world application of the proposed framework, we employed the spinal cord segmentation model to update a recently-introduced normative database of healthy participants containing commonly used measures of spinal cord morphometry. Results showed that: (i) our model performs well compared to its previous versions and existing pathology-specific models on the lumbar spinal cord, images with severe compression, and in the presence of intramedullary lesions and/or atrophy achieving an average Dice score of 0.95 ± 0.03; (ii) the automatic workflow for monitoring morphometric drift provides a quick feedback loop for developing future segmentation models; and (iii) the scaling factor required to update the database of morphometric measures is nearly constant among slices across the given vertebral levels, showing minimum drift between the current and previous versions of the model monitored by the framework. The model is freely available in Spinal Cord Toolbox v7.0.
LLMs Can't Play Hangman: On the Necessity of a Private Working Memory for Language Agents
Investigating the Multilingual Calibration Effects of Language Model Instruction-Tuning
Peng Lu
Qiuhao Zeng
Yusuke Iwasawa
Yutaka Matsuo
Edison Marrese-Taylor
Irene Li
Ensuring that deep learning models are well-calibrated in terms of their predictive uncertainty is essential in maintaining their trustworth… (voir plus)iness and reliability, yet despite increasing advances in foundation model research, the relationship between such large language models (LLMs) and their calibration remains an open area of research. In this work, we look at a critical gap in the calibration of LLMs within multilingual settings, in an attempt to better understand how the data scarcity can potentially lead to different calibration effects and how commonly used techniques can apply in these settings. Our analysis on two multilingual benchmarks, over 29 and 42 languages respectively, reveals that even in low-resource languages, model confidence can increase significantly after instruction-tuning on high-resource language SFT datasets. However, improvements in accuracy are marginal or non-existent, resulting in mis-calibration, highlighting a critical shortcoming of standard SFT for multilingual languages. Furthermore, we observe that the use of label smoothing to be a reasonable method alleviate this concern, again without any need for low-resource SFT data, maintaining better calibration across all languages. Overall, this highlights the importance of multilingual considerations for both training and tuning LLMs in order to improve their reliability and fairness in downstream use.
Effect of Document Packing on the Latent Multi-Hop Reasoning Capabilities of Large Language Models
The standard practice for training large language models involves packing multiple documents together to optimize computational efficiency. … (voir plus)However, the impact of this process on the models' capabilities remains largely unexplored. To address this gap, we investigate how different document-packing strategies influence the latent multi-hop reasoning abilities of LLMs. Our findings indicate that packing can improve model performance compared to training on individual documents, at the expense of more compute. To further understand the underlying mechanisms, we conduct an ablation study, identifying key factors that explain the advantages of packing. Ultimately, our research deepens the understanding of LLM training dynamics and provides practical insights for optimizing model development.
Neural Coherence : Find higher performance to out-of-distribution tasks from few samples
Probing the effectiveness of World Models for Spatial Reasoning through Test-time Scaling
Muhammad Jehanzeb Mirza
Wei Lin
Shiqi Yang
Vision-Language Models (VLMs) remain limited in spatial reasoning tasks that require multi-view understanding and embodied perspective shift… (voir plus)s. Recent approaches such as MindJourney attempt to mitigate this gap through test-time scaling where a world model imagines action-conditioned trajectories and a heuristic verifier selects helpful views from such trajectories. In this work, we systematically examine how such test-time verifiers behave across benchmarks, uncovering both their promise and their pitfalls. Our uncertainty-based analyses show that MindJourney's verifier provides little meaningful calibration, and that random scoring often reduces answer entropy equally well, thus exposing systematic action biases and unreliable reward signals. To mitigate these, we introduce a Verification through Spatial Assertions (ViSA) framework that grounds the test-time reward in verifiable, frame-anchored micro-claims. This principled verifier consistently improves spatial reasoning on the SAT-Real benchmark and corrects trajectory-selection biases through more balanced exploratory behavior. However, on the challenging MMSI-Bench, none of the verifiers, including ours, achieve consistent scaling, suggesting that the current world models form an information bottleneck where imagined views fail to enrich fine-grained reasoning. Together, these findings chart the bad, good, and ugly aspects of test-time verification for world-model-based reasoning. Our code is available at https://github.com/chandar-lab/visa-for-mindjourney.
Shielded Controller Units for RL with Operational Constraints Applied to Remote Microgrids
Alexandre Blondin Mass'e
Rachid Hassani
Vincent Mai
Reinforcement learning (RL) is a powerful framework for optimizing decision-making in complex systems under uncertainty, an essential challe… (voir plus)nge in real-world settings, particularly in the context of the energy transition. A representative example is remote microgrids that supply power to communities disconnected from the main grid. Enabling the energy transition in such systems requires coordinated control of renewable sources like wind turbines, alongside fuel generators and batteries, to meet demand while minimizing fuel consumption and battery degradation under exogenous and intermittent load and wind conditions. These systems must often conform to extensive regulations and complex operational constraints. To ensure that RL agents respect these constraints, it is crucial to provide interpretable guarantees. In this paper, we introduce Shielded Controller Units (SCUs), a systematic and interpretable approach that leverages prior knowledge of system dynamics to ensure constraint satisfaction. Our shield synthesis methodology, designed for real-world deployment, decomposes the environment into a hierarchical structure where each SCU explicitly manages a subset of constraints. We demonstrate the effectiveness of SCUs on a remote microgrid optimization task with strict operational requirements. The RL agent, equipped with SCUs, achieves a 24% reduction in fuel consumption without increasing battery degradation, outperforming other baselines while satisfying all constraints. We hope SCUs contribute to the safe application of RL to the many decision-making challenges linked to the energy transition.
The Markovian Thinker
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (voir plus)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
The Markovian Thinker
Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (voir plus)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.
Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation
Aman Jaiswal
Oleh Shliazhko
Orlando Marquez Ayala
Massimo Caccia
Alexandre Lacoste
Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation
Aman Jaiswal
Oleh Shliazhko
Orlando Marquez Ayala
Massimo Caccia
Alexandre Lacoste
Large language model (LLM) agents perform well in sequential decision-making tasks, but improving them on unfamiliar domains often requires … (voir plus)costly online interactions or fine-tuning on large expert datasets. These strategies are impractical for closed-source models and expensive for open-source ones, with risks of catastrophic forgetting. Offline trajectories offer reusable knowledge, yet demonstration-based methods struggle because raw traces are long, noisy, and tied to specific tasks. We present Just-in-time Episodic Feedback Hinter (JEF Hinter), an agentic system that distills offline traces into compact, context-aware hints. A zooming mechanism highlights decisive steps in long trajectories, capturing both strategies and pitfalls. Unlike prior methods, JEF Hinter leverages both successful and failed trajectories, extracting guidance even when only failure data is available, while supporting parallelized hint generation and benchmark-independent prompting. At inference, a retriever selects relevant hints for the current state, providing targeted guidance with transparency and traceability. Experiments on MiniWoB++, WorkArena-L1, and WebArena-Lite show that JEF Hinter consistently outperforms strong baselines, including human- and document-based hints.