Portrait of Amir-massoud Farahmand

Amir-massoud Farahmand

Core Academic Member
Associate Professor, Polytechnique Montréal
University of Toronto
Research Topics
Deep Learning
Machine Learning Theory
Reasoning
Reinforcement Learning

Biography

Amir-massoud Farahmand is an associate professor at the Department of Computer and Software Engineering, Polytechnique Montréal and a core academic member at Mila - Quebec Artificial Intelligence Institute, as well as an associate professor (status-only) at the Department of Computer Science, University of Toronto. He was a research scientist and CIFAR AI Chair at the Vector Institute in Toronto between 2018–2024, and principal research scientist at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, USA between 2014-2018. He received his PhD from the University of Alberta in 2011, followed by postdoctoral fellowships at McGill University (2011–2014) and Carnegie Mellon University (CMU) (2014).

Amir-massoud’s research vision is to understand the computational and statistical mechanisms required to design efficient AI agents that interact with their environment and adaptively improve their long-term performance. He has experience in developing Reinforcement Learning and Machine Learning methods to solve industrially-motivated problems as well.

Current Students

Collaborating researcher - McGill University University
Collaborating researcher - University of Toronto
Collaborating researcher - Polytechnique Montréal
Master's Research - Polytechnique Montréal

Publications

Press Start to Charge: Videogaming the Online Centralized Charging Scheduling Problem
Alireza Ghahtarani
Martin Cousineau
Jorge E. Mendoza
Majority of the Bests: Improving Best-of-N via Bootstrapping
Amin Rakhsha
Amir Khasahmadi
Sampling multiple outputs from a Large Language Model (LLM) and selecting the most frequent (Self-consistency) or highest-scoring (Best-of-N… (see more)) candidate is a popular approach to achieve higher accuracy in tasks with discrete final answers. Best-of-N (BoN) selects the output with the highest reward, and with perfect rewards, it often achieves near-perfect accuracy. With imperfect rewards from reward models, however, BoN fails to reliably find the correct answer and its performance degrades drastically. We consider the distribution of BoN's outputs and highlight that, although the correct answer does not usually have a probability close to one under imperfect rewards, it is often the most likely outcome. This suggests that the mode of this distribution can be more reliably correct than a sample from it. Based on this idea, we propose Majority-of-the-Bests (MoB), a novel selection mechanism that estimates the output distribution of BoN via bootstrapping and selects its mode. Experimental results across five benchmarks, three different base LLMs, and two reward models demonstrate consistent improvements over BoN in 25 out of 30 setups. We also provide theoretical results for the consistency of the bootstrapping. MoB serves as a simple, yet strong alternative to BoN and self-consistency, and more broadly, motivates further research in more nuanced selection mechanisms.
Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics
Mark Rowland
Yunhao Tang
Murat A Erdogdu
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Avery Ma
Yangchen Pan
Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences… (see more). To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Relative Entropy Pathwise Policy Optimization
Claas Voelcker
Axel Brunnbauer
Marcel Hussing
Michal Nauman
Pieter Abbeel
Eric R. Eaton
Radu Grosu
Igor Gilitschenski
Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-varianc… (see more)e often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.
Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Allan D. Jepson
Deflated Dynamics Value Iteration
Jongmin Lee
Amin Rakhsha
Ernest K. Ryu
The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of… (see more) many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration
Efficient and Accurate Optimal Transport with Mirror Descent and Conjugate Gradients
Mete Kemertas
Allan Jepson
MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL
Claas Voelcker
Marcel Hussing
Eric R. Eaton
Igor Gilitschenski
Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sam… (see more)ple efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.
Improving Adversarial Transferability via Model Alignment
Avery Ma
Yangchen Pan
Philip Torr
Jindong Gu
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a no… (see more)vel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.
Dissecting Deep RL with High Update Ratios: Combatting Value Overestimation and Divergence
Marcel Hussing
Claas Voelcker
Igor Gilitschenski
Eric R. Eaton
We show that deep reinforcement learning can maintain its ability to learn without resetting network parameters in settings where the number… (see more) of gradient updates greatly exceeds the number of environment samples. Under such large update-to-data ratios, a recent study by Nikishin et al. (2022) suggested the emergence of a primacy bias , in which agents overfit early interactions and downplay later experience, impairing their ability to learn. In this work, we dissect the phenomena underlying the primacy bias. We inspect the early stages of training that ought to cause the failure to learn and find that a fundamental challenge is a long-standing acquaintance: value overestimation. Overinflated Q-values are found not only on out-of-distribution but also in-distribution data and can be traced to unseen action prediction propelled by optimizer momentum. We employ a simple unit-ball normalization that enables learning under large update ratios, show its efficacy on the widely used dm_control suite, and obtain strong performance on the challenging dog tasks, competitive with model-based approaches. Our results question, in parts, the prior explanation for sub-optimal learning due to overfitting on early data.