Portrait of Amir-massoud Farahmand

Amir-massoud Farahmand

Core Academic Member
Associate Professor, Polytechnique Montréal
University of Toronto
Research Topics
Deep Learning
Machine Learning Theory
Reasoning
Reinforcement Learning

Biography

Amir-massoud Farahmand is an associate professor at the Department of Computer and Software Engineering, Polytechnique Montréal and a core academic member at Mila - Quebec Artificial Intelligence Institute, as well as an associate professor (status-only) at the Department of Computer Science, University of Toronto. He was a research scientist and CIFAR AI Chair at the Vector Institute in Toronto between 2018–2024, and principal research scientist at Mitsubishi Electric Research Laboratories (MERL) in Cambridge, USA between 2014-2018. He received his PhD from the University of Alberta in 2011, followed by postdoctoral fellowships at McGill University (2011–2014) and Carnegie Mellon University (CMU) (2014).

Amir-massoud’s research vision is to understand the computational and statistical mechanisms required to design efficient AI agents that interact with their environment and adaptively improve their long-term performance. He has experience in developing Reinforcement Learning and Machine Learning methods to solve industrially-motivated problems as well.

Current Students

Collaborating researcher - McGill University University
Collaborating researcher - University of Toronto

Publications

Relative Entropy Pathwise Policy Optimization
Claas Voelcker
Axel Brunnbauer
Marcel Hussing
Michal Nauman
Pieter Abbeel
Eric R. Eaton
Radu Grosu
Igor Gilitschenski
Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-varianc… (see more)e often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.
Calibrated Value-Aware Model Learning with Stochastic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Romina Abachi
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Romina Abachi
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Romina Abachi
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (see more)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Allan D. Jepson
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Allan D. Jepson
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Avery Ma
Yangchen Pan
Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences… (see more). To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Avery Ma
Yangchen Pan
Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences… (see more). To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Deflated Dynamics Value Iteration
Jongmin Lee
Amin Rakhsha
Ernest K. Ryu
The Value Iteration (VI) algorithm is an iterative procedure to compute the value function of a Markov decision process, and is the basis of… (see more) many reinforcement learning (RL) algorithms as well. As the error convergence rate of VI as a function of iteration
MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL
Claas Voelcker
Marcel Hussing
Eric R. Eaton
Igor Gilitschenski
Building deep reinforcement learning (RL) agents that find a good policy with few samples has proven notoriously challenging. To achieve sam… (see more)ple efficiency, recent work has explored updating neural networks with large numbers of gradient steps for every new sample. While such high update-to-data (UTD) ratios have shown strong empirical performance, they also introduce instability to the training process. Previous approaches need to rely on periodic neural network parameter resets to address this instability, but restarting the training process is infeasible in many real-world applications and requires tuning the resetting interval. In this paper, we focus on one of the core difficulties of stable training with limited samples: the inability of learned value functions to generalize to unobserved on-policy actions. We mitigate this issue directly by augmenting the off-policy RL training process with a small amount of data generated from a learned world model. Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD) uses small amounts of generated data to stabilize high UTD training and achieve competitive performance on the most challenging tasks in the DeepMind control suite. Our experiments further highlight the importance of employing a good model to generate data, MAD-TD's ability to combat value overestimation, and its practical stability gains for continued learning.
Improving Adversarial Transferability via Model Alignment
Avery Ma
Yangchen Pan
Philip Torr
Jindong Gu
Neural networks are susceptible to adversarial perturbations that are transferable across different models. In this paper, we introduce a no… (see more)vel model alignment technique aimed at improving a given source model's ability in generating transferable adversarial perturbations. During the alignment process, the parameters of the source model are fine-tuned to minimize an alignment loss. This loss measures the divergence in the predictions between the source model and another, independently trained model, referred to as the witness model. To understand the effect of model alignment, we conduct a geometric analysis of the resulting changes in the loss landscape. Extensive experiments on the ImageNet dataset, using a variety of model architectures, demonstrate that perturbations generated from aligned source models exhibit significantly higher transferability than those from the original source model.
When does Self-Prediction help? Understanding Auxiliary Tasks in Reinforcement Learning
Claas Voelcker
Igor Gilitschenski
We investigate the impact of auxiliary learning tasks such as observation reconstruction and latent self-prediction on the representation le… (see more)arning problem in reinforcement learning. We also study how they interact with distractions and observation functions in the MDP. We provide a theoretical analysis of the learning dynamics of observation reconstruction, latent self-prediction, and TD learning in the presence of distractions and observation functions under linear model assumptions. With this formalization, we are able to explain why latent-self prediction is a helpful \emph{auxiliary task}, while observation reconstruction can provide more useful features when used in isolation. Our empirical analysis shows that the insights obtained from our learning dynamics framework predicts the behavior of these loss functions beyond the linear model assumption in non-linear neural networks. This reinforces the usefulness of the linear model framework not only for theoretical analysis, but also practical benefit for applied problems.