Aaron Courville

Alan Alan

PhD - Université de Montréal

Principal supervisor :

Simon Lacoste-Julien

Reza Bayat

PhD - Université de Montréal

Co-supervisor :

Pascal Vincent

Anirudh Buvanesh

PhD - Université de Montréal

Principal supervisor :

Laurent Charlin

Abhranil Chandra

Collaborating researcher - University of Waterloo

Master's Research - Université de Montréal

Juan Duque

PhD - Université de Montréal

PhD - Université de Montréal

Arian Hosseini

PhD - Université de Montréal

Amr Khalifa

PhD - Université de Montréal

Samuel Lavoie

PhD - Université de Montréal

Zhixuan Lin

PhD - Université de Montréal

Ahmed Masry

Collaborating researcher - N/A

Andjela Mladenovic

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Rishabh Agarwal

Andrei Nicolicioiu

PhD - Université de Montréal

Evgenii Nikishin

Collaborating Alumni - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Johan Samir Obando Ceron

PhD - Université de Montréal

Co-supervisor :

Collaborating researcher - Université de Montréal

Dereck Piché

Master's Research - Université de Montréal

Khaled Rouissi

Master's Research - Université de Montréal

Esra'a Saleh

PhD - Université de Montréal

Principal supervisor :

Glen Berseth

Vedant Shah

PhD - Université de Montréal

PhD - Université de Montréal

Yusong Wu

PhD - Université de Montréal

Principal supervisor :

Anna (Cheng-Zhi) Huang

Sujin yun

PhD - Université de Montréal

Xiaofeng Zhang

PhD - Université de Montréal

Dinghuai Zhang

PhD - Université de Montréal

Co-supervisor :

Yoshua Bengio

Hattie Zhou

PhD - Université de Montréal

Principal supervisor :

Hugo Larochelle

Publications

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (see more)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan S. Obando-Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

Most recent RL for LLMs (RL4LLM) methods avoid explicit critics, replacing them with average advantage baselines. This shift is largely prag… (see more)matic: conventional value functions are computationally expensive to train at LLM scale and often fail under sparse rewards and long reasoning horizons. We revisit this bottleneck from an architectural perspective and introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings. AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards. This design encourages diversity while preserving calibration, reducing value-estimation bias. Beyond robust estimation, AsyPPO leverages inter-critic uncertainty to refine the policy update: (i) masking advantages in states where critics agree and gradients add little learning signal, and (ii) filtering high-divergence states from entropy regularization, suppressing spurious exploration. After training on open-source data with only 5,000 samples, AsyPPO consistently improves learning stability and performance across multiple benchmarks over strong baselines, such as GRPO, achieving performance gains of more than six percent on Qwen3-4b-Base and about three percent on Qwen3-8b-Base and Qwen3-14b-Base over classic PPO, without additional tricks. These results highlight the importance of architectural innovations for scalable, efficient algorithms.

2025-10-02

ArXiv (preprint)

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan S. Obando-Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

2025-10-02

ArXiv (preprint)

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan S. Obando-Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

2025-10-02

ArXiv (preprint)

Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning

Jiashun Liu

Johan S. Obando-Ceron

Han Lu

Yancheng He

Weixun Wang

Wenbo Su

Bo Zheng

2025-10-02

ArXiv (preprint)

BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

Joao Monteiro

Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either rele… (see more)vant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose BiXSE, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.

2025-08-09

ArXiv (preprint)

doi.org

BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

Joao Monteiro

2025-08-01

arXiv (published)

doi.org

Sample, Predict, then Proceed: Self-Verification Sampling for Tool Use of LLMs

Shangmin Guo

Omar Darwiche Domingues

Raphaël Avalos

Florian Strub

Tool use in stateful environments presents unique challenges for large language models (LLMs), where existing test-time compute strategies r… (see more)elying on repeated trials in the environment are impractical. We propose dynamics modelling (DyMo), a method that augments LLMs with a state prediction capability alongside function calling during post-training. This enables LLMs to predict the future states of their actions through an internal environment model. On the Berkeley Function Calling Leaderboard V2, DyMo improves success rates and significantly reduces hallucinations. We further integrate the internal environment model into self-verification sampling (SVS), and show that this substantially improves pass^k over number of trials k, and allows the model to refuse unreliable outputs. Together, DyMo and SVS greatly enhance the effectiveness and reliability of LLMs for tool use. We believe this work charts a path towards scalable planning RL methods for LLM inference without repeatedly querying the oracle environment.

2025-07-17

EWRL/2025/Workshop (poster)

openreview.net

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie

Michael Noukhovitch

We argue that diffusion models'success in modeling complex distributions is, for the most part, coming from their input conditioning. This p… (see more)aper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

2025-07-16

ArXiv (preprint)

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie

Michael Noukhovitch

2025-07-16

ArXiv (preprint)

doi.org

Adaptive Computation Pruning for the Forgetting Transformer

Zhixuan Lin

Johan Samir Obando Ceron

Xu Owen He