Portrait de Amir-massoud Farahmand

Amir-massoud Farahmand

Membre académique principal
Professeur associé, Polytechnique Montréal
University of Toronto
Sujets de recherche
Apprentissage par renforcement
Apprentissage profond
Raisonnement
Théorie de l'apprentissage automatique

Biographie

Amir-massoud Farahmand est professeur associé au Département de génie informatique et logiciel de Polytechnique Montréal et membre académique principal à Mila - Institut québécois d'intelligence artificielle, ainsi que professeur associé (statut uniquement) au Département d'informatique de l'Université de Toronto. Il a été chercheur scientifique et titulaire de la chaire CIFAR AI au Vector Institute de Toronto entre 2018-2024, et chercheur principal aux Mitsubishi Electric Research Laboratories (MERL) à Cambridge, aux États-Unis, entre 2014-2018. Il a obtenu son doctorat à l'Université de l'Alberta en 2011, suivi de bourses postdoctorales à l'Université McGill (2011-2014) et à l'Université Carnegie Mellon (CMU) (2014).

La vision de recherche d'Amir-massoud est de comprendre les mécanismes informatiques et statistiques nécessaires pour concevoir des agents d'intelligence artificielle efficaces qui interagissent avec leur environnement et améliorent de manière adaptative leur performance à long terme. Il a de l'expérience dans le développement de méthodes d'apprentissage par renforcement et d'apprentissage automatique pour résoudre des problèmes industriels.

Étudiants actuels

Collaborateur·rice de recherche - McGill University
Collaborateur·rice de recherche - University of Toronto
Doctorat - Polytechnique

Publications

Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model’s value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics
Mark Rowland
Yunhao Tang
Murat A Erdogdu
Categorical Distributional Reinforcement Learning with Kullback-Leibler Divergence: Convergence and Asymptotics
Mark Rowland
Yunhao Tang
Murat A Erdogdu
We study the problem of distributional reinforcement learning using categorical parametrisations and a KL divergence loss. Previous work ana… (voir plus)lyzing categorical distributional RL has done so using a Cramér distance-based loss, simplifying the analysis but creating a theory-practice gap. We introduce a preconditioned version of the algorithm, and prove that it is guaranteed to converge. We further derive the asymptotic variance of the categorical estimates under different learning rate regimes, and compare to that of classical reinforcement learning. We finally empirically validate our theoretical results and perform an empirical investigation into the relative strengths of using KL losses, and derive a number of actionable insights for practitioners.
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Avery Ma
Yangchen Pan
Many-shot jailbreaking circumvents the safety alignment of LLMs by exploiting their ability to process long input sequences. To achieve this… (voir plus), the malicious target prompt is prefixed with hundreds of fabricated conversational exchanges between the user and the model. These exchanges are randomly sampled from a pool of unsafe question-answer pairs, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with Positive Affirmations, Negative Demonstrations, and an optimized Adaptive Sampling method tailored to the target prompt’s topic. We also introduce ManyHarm, a dataset of harmful question–answer pairs, and demonstrate through extensive experiments that PANDAS significantly outperforms baseline methods in long-context scenarios. Through attention analysis, we provide insights into how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
PANDAS: Improving Many-shot Jailbreaking via Positive Affirmation, Negative Demonstration, and Adaptive Sampling
Avery Ma
Yangchen Pan
Many-shot jailbreaking circumvents the safety alignment of large language models by exploiting their ability to process long input sequences… (voir plus). To achieve this, the malicious target prompt is prefixed with hundreds of fabricated conversational turns between the user and the model. These fabricated exchanges are randomly sampled from a pool of malicious questions and responses, making it appear as though the model has already complied with harmful instructions. In this paper, we present PANDAS: a hybrid technique that improves many-shot jailbreaking by modifying these fabricated dialogues with positive affirmations, negative demonstrations, and an optimized adaptive sampling method tailored to the target prompt's topic. Extensive experiments on AdvBench and HarmBench, using state-of-the-art LLMs, demonstrate that PANDAS significantly outperforms baseline methods in long-context scenarios. Through an attention analysis, we provide insights on how long-context vulnerabilities are exploited and show how PANDAS further improves upon many-shot jailbreaking.
Relative Entropy Pathwise Policy Optimization
Claas Voelcker
Axel Brunnbauer
Marcel Hussing
Michal Nauman
Pieter Abbeel
Eric R. Eaton
Radu Grosu
Igor Gilitschenski
Score-function policy gradients have delivered strong results in game-playing, robotics and language-model fine-tuning. Yet its high-varianc… (voir plus)e often undermines training stability. On the other hand, pathwise policy gradients alleviate the training variance, but are reliable only when driven by an accurate action-conditioned value function which is notoriously hard to train without relying on past off-policy data. In this paper, we discuss how to construct a value-gradient driven, on-policy algorithm that allow training Q-value models purely from on-policy data, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to balance stochastic policies for exploration with constrained policy updates for stable training, and evaluate important architectural components that facilitate accurate value function learning. Building on these insights, we propose Relative Entropy Pathwise Policy Optimization (REPPO), an efficient on-policy algorithm that combines the sample-efficiency of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. We demonstrate that REPPO provides strong empirical performance at decreased sample requirements, wall-clock time, memory footprint as well as high hyperparameter robustness in a set of experiments on two standard GPU-parallelized benchmarks.
Relative Entropy Pathwise Policy Optimization
Claas Voelcker
Axel Brunnbauer
Marcel Hussing
Michal Nauman
Pieter Abbeel
Eric R. Eaton
Radu Grosu
Igor Gilitschenski
Score-function based methods for policy learning, such as REINFORCE and PPO, have delivered strong results in game-playing and robotics, yet… (voir plus) their high variance often undermines training stability. Using pathwise policy gradients, i.e. computing a derivative by differentiating the objective function, alleviates the variance issues. However, they require an accurate action-conditioned value function, which is notoriously hard to learn without relying on replay buffers for reusing past off-policy data. We present an on-policy algorithm that trains Q-value models purely from on-policy trajectories, unlocking the possibility of using pathwise policy updates in the context of on-policy learning. We show how to combine stochastic policies for exploration with constrained updates for stable training, and evaluate important architectural components that stabilize value function learning. The result, Relative Entropy Pathwise Policy Optimization (REPPO), is an efficient on-policy algorithm that combines the stability of pathwise policy gradients with the simplicity and minimal memory footprint of standard on-policy learning. Compared to state-of-the-art on two standard GPU-parallelized benchmarks, REPPO provides strong empirical performance at superior sample efficiency, wall-clock time, memory footprint, and hyperparameter robustness.
Calibrated Value-Aware Model Learning with Stochastic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
Calibrated Value-Aware Model Learning with Probabilistic Environment Models
Claas Voelcker
Anastasiia Pedan
Arash Ahmadian
Igor Gilitschenski
The idea of value-aware model learning, that models should produce accurate value estimates, has gained prominence in model-based reinforcem… (voir plus)ent learning. The MuZero loss, which penalizes a model's value function prediction compared to the ground-truth value function, has been utilized in several prominent empirical works in the literature. However, theoretical investigation into its strengths and weaknesses is limited. In this paper, we analyze the family of value-aware model learning losses, which includes the popular MuZero loss. We show that these losses, as normally used, are uncalibrated surrogate losses, which means that they do not always recover the correct model and value function. Building on this insight, we propose corrections to solve this issue. Furthermore, we investigate the interplay between the loss calibration, latent model architectures, and auxiliary losses that are commonly employed when training MuZero-style agents. We show that while deterministic models can be sufficient to predict accurate values, learning calibrated stochastic models is still advantageous.
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Allan D. Jepson
A Truncated Newton Method for Optimal Transport
Mete Kemertas
Allan D. Jepson