Mathieu Reymond

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

Nilaksh

Antoine Clavaud

Franccois Rivest

A. Chandar

AI Institute

Polytechnique Montr ´ eal

2026-02-09

arXiv (prépublication)

Squeezing More from the Stream : Learning Representation Online for Streaming Reinforcement Learning

Nilaksh

In streaming Reinforcement Learning (RL), transitions are observed and discarded immediately after a single update. While this minimizes res… (voir plus)ource usage for on-device applications, it makes agents notoriously sample-inefficient, since value-based losses alone struggle to extract meaningful representations from transient data. We propose extending Self-Predictive Representations (SPR) to the streaming pipeline to maximize the utility of every observed frame. However, due to the highly correlated samples induced by the streaming regime, naively applying this auxiliary loss results in training instabilities. Thus, we introduce orthogonal gradient updates relative to the momentum target and resolve gradient conflicts arising from streaming-specific optimizers. Validated across the Atari, MinAtar, and Octax suites, our approach systematically outperforms existing streaming baselines. Latent-space analysis, including t-SNE visualizations and effective-rank measurements, confirms that our method learns significantly richer representations, bridging the performance gap caused by the absence of a replay buffer, while remaining efficient enough to train on just a few CPU cores.

2025-12-31

International Conference on Machine Learning (Accept (regular))

Just-in-time Episodic Feedback Hinter: Leveraging Offline Knowledge to Improve LLM Agents Adaptation

Hadi Nekoei

A. Jaiswal

Patrice Béchard

Oleh Shliazhko

Orlando Marquez Ayala

Massimo Caccia

Alexandre Drouin

A. Chandar

Alexandre Lacoste

2025-10-04

ArXiv (prépublication)

GRPO-λ: Credit Assignment improves LLM Reasoning

Prasanna Parthasarathi

Boxing Chen

Yufei Cui

A. Chandar

Large language models (LLMs) are increasingly deployed for tasks requiring complex reasoning, prompting significant interest in improving th… (voir plus)eir reasoning abilities through post-training. Especially RL based methods using verifiable reward, like the state-of-the-art GRPO, have shown to tremendously improve reasoning behaviors when applied as post-training methods. However, the lack of an explicit reward or critic model limits GRPO's ability to assign fine-grained credit across token sequences. In this work, we present GRPO-

2025-09-29

ArXiv (prépublication)

Revisiting Laplacian Representations for Value Function Approximation in Deep RL

Priyesh Vijayan

Padideh Nouri

Rishav

A. Chandar

Yash Chandak

S Ebrahimi Kahou

Doina Precup

Proto-value functions (PVFs) introduced Laplacian embeddings as an effective feature basis for value-function approximation; however, their … (voir plus)utility remained limited to small, fully known state spaces. Recent work has scaled Laplacian embeddings to high-dimensional inputs, using them for reward shaping and option discovery in goal-directed tasks, yet only as auxiliary signals, rather than directly using them as features for value functions. In this paper, we learn Laplacian eigenvectors online and employ them as features for Q-learning in 23 Atari games. We empirically demonstrate that these online–learned embeddings substantially improve model-free RL in large, high-dimensional domains. We demonstrate that enriching state representations with action embeddings yields additional gains under both behavior-policy and uniform-random policies. Additionally, we introduce the Fusion architecture, which augments the representation with useful inductive bias at the embedding level. To assess the usefulness of each embedding used in the Fusion architecture, we use Shapley values analysis.

2025-06-21

rl-conference.cc/RLC/2025/Workshop/IBRL (publié)

CrystalGym: A New Benchmark for Materials Discovery Using Reinforcement Learning

Prashant Govindarajan

Antoine Clavaud

Mariano Phielipp

Santiago Miret

A. Chandar

*In silico* design and optimization of new materials primarily relies on high-accuracy atomic simulators that perform density functional the… (voir plus)ory (DFT) calculations. While recent works showcase the strong potential of machine learning to accelerate the material design process, they mostly consist of generative approaches that do not use direct DFT signals as feedback to improve training and generation mainly due to DFT's high computational cost. To aid the adoption of direct DFT signals in the materials design loop through online reinforcement learning (RL), we propose **CrystalGym**, an open-source RL environment for crystalline material discovery. Using CrystalGym, we benchmark value- and policy-based reinforcement learning algorithms for designing various crystals conditioned on target properties. Concretely, we optimize for challenging properties like the band gap, bulk modulus, and density, which are directly calculated from DFT in the environment. While none of the algorithms we benchmark solve all CrystalGym tasks, our extensive experiments and ablations show different sample efficiencies and ease of convergence to optimality for different algorithms and environment settings. Our goal is for CrystalGym to serve as a test bed for reinforcement learning researchers and material scientists to address these real-world design problems with practical applications. Furthermore, we introduce a novel class of challenges for reinforcement learning methods dealing with time-consuming reward signals, paving the way for future interdisciplinary research for machine learning motivated by real-world applications.

2025-03-02

AI4MAT @ International Conference on Learning Representations (spotlight)

Arjun Vaithilingam Sudhakar

A Generalist Hanabi Agent

Hadi Nekoei

Miao Liu

Janarthanan Rajendran

Sarath Chandar

Traditional multi-agent reinforcement learning (MARL) systems can develop cooperative strategies through repeated interactions. However, the… (voir plus)se systems are unable to perform well on any other setting than the one they have been trained on, and struggle to successfully cooperate with unfamiliar collaborators. This is particularly visible in the Hanabi benchmark, a popular 2-to-5 player cooperative card-game which requires complex reasoning and precise assistance to other agents. Current MARL agents for Hanabi can only learn one specific game-setting (e.g., 2-player games), and play with the same algorithmic agents. This is in stark contrast to humans, who can quickly adjust their strategies to work with unfamiliar partners or situations. In this paper, we introduce Recurrent Replay Relevance Distributed DQN (R3D2), a generalist agent for Hanabi, designed to overcome these limitations. We reformulate the task using text, as language has been shown to improve transfer. We then propose a distributed MARL algorithm that copes with the resulting dynamic observation- and action-space. In doing so, our agent is the first that can play all game settings concurrently, and extend strategies learned from one setting to other ones. As a consequence, our agent also demonstrates the ability to collaborate with different algorithmic agents -- agents that are themselves unable to do so. The implementation code is available at:

2025-01-21

ICLR.cc/2025/Conference (poster)

Crystal Design Amidst Noisy DFT Signals: A Reinforcement Learning Approach

Prashant Govindarajan

Santiago Miret

Mariano Phielipp

A. Chandar

2024-11-02

NeurIPS.cc/2024/Workshop/AI4Mat (publié)

A Reinforcement Learning Pipeline for Band Gap-directed Crystal Generation

Prashant Govindarajan

Santiago Miret

Antoine Clavaud

Mariano Phielipp

Sarath Chandar

Property-driven AI-automated material discovery presents unique challenges owing to the complex nature of the chemical structural space and … (voir plus)computationally expensive simulations. For crystalline solids, the band gap is an important property for designing semiconductors and batteries. However, optimizing crystals for a target band gap is difficult and not well-explored. Reinforcement learning (RL) shows promise towards optimizing crystals, as it can freely explore the chemical space. However, it relies on regular band gap evaluations, which can only be accurately computed through expensive Density Functional Theory (DFT) simulations. In this study, we propose an active learning-inspired pipeline that combines RL and DFT simulations for optimizing crystal compositions given a target band gap. The pipeline includes an RL policy for predicting atom types and a band gap network that is fine-tuned with DFT data. Preliminary results indicate the need for furthering the state-of-the-art to address the inherent challenges of the problem.

2024-07-07

AI4Mat @ University of Natural Resources and Life Sciences, Vienna (poster)