Portrait of Sarath Chandar

Sarath Chandar

Core Academic Member
Canada CIFAR AI Chair
Associate Professor, Polytechnique Montréal, Department of Computer Engineering and Software Engineering
Adjunct Professor, Université de Montréal, Department of Computer Science and Operations Research
Indian Institute of Technology Madras
Research Topics
AI Alignment
Deep Learning
Explainable AI (XAI)
Foundation Models
Interpretability
Large Language Models (LLM)
Lifelong Learning
Medical Machine Learning
Multi-Agent Systems
Natural Language Processing
Online Learning
Optimization
Recurrent Neural Networks
Reinforcement Learning
Representation Learning
Transfer Learning
Trustworthy AI

Biography

Sarath Chandar is an associate professor at Polytechnique Montreal's Department of Computer and Software Engineering, where he leads the Chandar Research Lab. He is also a Core Academic Member at Mila – Quebec Artificial Intelligence Institute and holds a Canada CIFAR AI Chair and the Canada Research Chair in Lifelong Machine Learning.

Chandar’s research interests include lifelong learning, deep learning, optimization, reinforcement learning and natural language processing. To promote research in lifelong learning, Chandar created the Conference on Lifelong Learning Agents (CoLLAs) in 2022, for which he served as program chair in 2022 and 2023.

He has a PhD from Université de Montréal and an MSc (By Research) from the Indian Institute of Technology Madras.

Current Students

Master's Research - Université de Montréal
PhD - Polytechnique Montréal
Co-supervisor :
Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal
Principal supervisor :
PhD - Polytechnique Montréal
PhD - Université de Montréal
Principal supervisor :
Collaborating researcher - Université de Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Polytechnique Montréal
PhD - Université de Montréal
Master's Research - Polytechnique Montréal
PhD - Polytechnique Montréal
Co-supervisor :
PhD - Polytechnique Montréal
Master's Research - Polytechnique Montréal
Postdoctorate - Polytechnique Montréal
Principal supervisor :
PhD - Université de Montréal
PhD - Université de Montréal
PhD - Université de Montréal
Collaborating Alumni - Université de Montréal
Co-supervisor :
Independent visiting researcher
Master's Research - Université de Montréal
Master's Research - Université de Montréal
PhD - Polytechnique Montréal
PhD - Polytechnique Montréal
PhD - Polytechnique Montréal
PhD - Polytechnique Montréal

Publications

Steering Large Language Model Activations in Sparse Spaces
Reza Bayat
Ali Rahimi-Kalahroudi
Mohammad Pezeshki
Steering Large Language Model Activations in Sparse Spaces
Reza Bayat
Ali Rahimi-Kalahroudi
Mohammad Pezeshki
A key challenge in AI alignment is guiding large language models (LLMs) to follow desired behaviors at test time. Activation steering, which… (see more) modifies internal model activations during inference, offers a potential solution. However, prior work in dense activation spaces struggles with superposition, wherein multiple features become entangled, limiting interpretability and precise control. In contrast, sparse representations provide an untapped opportunity for more interpretable behavior modulation. In this work, we introduce sparse activation steering (SAS), a method that leverages sparse autoencoders (SAEs) to steer LLM behavior in sparse spaces. By isolating behavior-specific features through a contrastive prompt-pairing approach, we define a set of features that can selectively reinforce or suppress behaviors. Experiments on Gemma 2 LLMs show that SAS vectors enable nuanced behavioral modulation and finer-grained control. Furthermore, scaling SAEs improves monosemanticity of SAS vectors, suggesting more reliable and interpretable interventions.
NeoBERT: A Next-Generation BERT
Lola Le Breton
Quentin Fournier
Mariam El Mezouar
Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of … (see more)large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
NeoBERT: A Next-Generation BERT
Lola Le Breton
Quentin Fournier
Mariam El Mezouar
Recent innovations in architecture, pre-training, and fine-tuning have led to the remarkable in-context learning and reasoning abilities of … (see more)large auto-regressive language models such as LLaMA and DeepSeek. In contrast, encoders like BERT and RoBERTa have not seen the same level of progress despite being foundational for many downstream NLP applications. To bridge this gap, we introduce NeoBERT, a next-generation encoder that redefines the capabilities of bidirectional models by integrating state-of-the-art advancements in architecture, modern data, and optimized pre-training methodologies. NeoBERT is designed for seamless adoption: it serves as a plug-and-play replacement for existing base models, relies on an optimal depth-to-width ratio, and leverages an extended context length of 4,096 tokens. Despite its compact 250M parameter footprint, it achieves state-of-the-art results on the massive MTEB benchmark, outperforming BERT large, RoBERTa large, NomicBERT, and ModernBERT under identical fine-tuning conditions. In addition, we rigorously evaluate the impact of each modification on GLUE and design a uniform fine-tuning and evaluation framework for MTEB. We release all code, data, checkpoints, and training scripts to accelerate research and real-world adoption.
Sub-goal Distillation: A Method to Improve Small Language Agents
Maryam Hashemzadeh
Elias Stengel-Eskin
Marc-Alexandre Côté
While Large Language Models (LLMs) have demonstrated significant promise as agents in interactive tasks, their substantial computational req… (see more)uirements and restricted number of calls constrain their practical utility, especially in long-horizon interactive tasks such as decision-making or in scenarios involving continuous ongoing tasks. To address these constraints, we propose a method for transferring the performance of an LLM with billions of parameters to a much smaller language model (770M parameters). Our approach involves constructing a hierarchical agent comprising a planning module, which learns through Knowledge Distillation from an LLM to generate sub-goals, and an execution module, which learns to accomplish these sub-goals using elementary actions. In detail, we leverage an LLM to annotate an oracle path with a sequence of sub-goals towards completing a goal. Subsequently, we utilize this annotated data to fine-tune both the planning and execution modules. Importantly, neither module relies on real-time access to an LLM during inference, significantly reducing the overall cost associated with LLM interactions to a fixed cost. In ScienceWorld, a challenging and multi-task interactive text environment, our method surpasses standard imitation learning based solely on elementary actions by 16.7% (absolute). Our analysis highlights the efficiency of our approach compared to other LLM-based methods. Our code and annotated data for distillation can be found on GitHub.
A Generalist Hanabi Agent
Arjun V Sudhakar
Hadi Nekoei
Mathieu Reymond
Miao Liu
Janarthanan Rajendran
Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, the… (see more)se systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents ---agents that are themselves unable to do so.
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Torque-Aware Momentum
Pranshu Malviya
Goncalo Mordido
Aristide Baratin
Reza Babanezhad Harikandeh
Efficiently exploring complex loss landscapes is key to the performance of deep neural networks. While momentum-based optimizers are widely … (see more)used in state-of-the-art setups, classical momentum can still struggle with large, misaligned gradients, leading to oscillations. To address this, we propose Torque-Aware Momentum (TAM), which introduces a damping factor based on the angle between the new gradients and previous momentum, stabilizing the update direction during training. Empirical results show that TAM, which can be combined with both SGD and Adam, enhances exploration, handles distribution shifts more effectively, and improves generalization performance across various tasks, including image classification and large language model fine-tuning, when compared to classical momentum-based optimizers.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
Mats Leon Richter
Juan A. Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.
Too Big to Fool: Resisting Deception in Language Models
Mohammad Reza Samsami
M. L. Richter
Juan Rodriguez
Megh Thakkar
Large language models must balance their weight-encoded knowledge with in-context information from prompts to generate accurate responses. T… (see more)his paper investigates this interplay by analyzing how models of varying capacities within the same family handle intentionally misleading in-context information. Our experiments demonstrate that larger models exhibit higher resilience to deceptive prompts, showcasing an advanced ability to interpret and integrate prompt information with their internal knowledge. Furthermore, we find that larger models outperform smaller ones in following legitimate instructions, indicating that their resilience is not due to disregarding in-context information. We also show that this phenomenon is likely not a result of memorization but stems from the models' ability to better leverage implicit task-relevant information from the prompt alongside their internally stored knowledge.