Portrait de Hossein Aboutalebi n'est pas disponible

Hossein Aboutalebi

Alumni

Publications

Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning
While significant research advances have been made in the field of deep reinforcement learning, there have been no concrete adversarial atta… (voir plus)ck strategies in literature tailored for studying the vulnerability of deep reinforcement learning algorithms to membership inference attacks. In such attacking systems, the adversary targets the set of collected input data on which the deep reinforcement learning algorithm has been trained. To address this gap, we propose an adversarial attack framework designed for testing the vulnerability of a state-of-the-art deep reinforcement learning algorithm to a membership inference attack. In particular, we design a series of experiments to investigate the impact of temporal correlation, which naturally exists in reinforcement learning training data, on the probability of information leakage. Moreover, we compare the performance of \emph{collective} and \emph{individual} membership attacks against the deep reinforcement learning algorithm. Experimental results show that the proposed adversarial attack framework is surprisingly effective at inferring data with an accuracy exceeding
Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards
A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structur… (voir plus)es and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.
Where Did You Learn That From? Surprising Effectiveness of Membership Inference Attacks Against Temporally Correlated Data in Deep Reinforcement Learning
to
Learning Modular Safe Policies in the Bandit Setting with Application to Adaptive Clinical Trials
The stochastic multi-armed bandit problem is a well-known model for studying the exploration-exploitation trade-off. It has significant poss… (voir plus)ible applications in adaptive clinical trials, which allow for dynamic changes in the treatment allocation probabilities of patients. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data, especially when the sample size is small. In this paper, we define and study a new robustness criterion for bandit problems. Specifically, we consider optimizing a function of the distribution of returns as a regret measure. This provides practitioners more flexibility to define an appropriate regret measure. The learning algorithm we propose to solve this type of problem is a modification of the BESA algorithm [Baransi et al., 2014], which considers a more general version of regret. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset from the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.
Learning Reliable Policies in the Bandit Setting with Application to Adaptive Clinical Trials
The stochastic multi-armed bandit problem is a well-known model for studying the explorationexploitation trade-off. It has significant possi… (voir plus)ble applications in adaptive clinical trials, which allow for a dynamic change of patient allocation ratios. However, most bandit learning algorithms are designed with the goal of minimizing the expected regret. While this approach is useful in many areas, in clinical trials, it can be sensitive to outlier data especially when the sample size is small. In this article, we propose a modification of the BESA algorithm [Baransi, Maillard, and Mannor, 2014] which takes into account the variance in the action outcomes in addition to the mean. We present a regret bound for our approach and evaluate it empirically both on synthetic problems as well as on a dataset form the clinical trial literature. Our approach compares favorably to a suite of standard bandit algorithms.
Learning Predictive State Representations From Non-Uniform Sampling
Yuri Grinberg
Melanie Lyman-Abramovitch
Borja Balle
Predictive state representations (PSR) have emerged as a powerful method for modelling partially observable environments. PSR learning algor… (voir plus)ithms can build models for predicting all observable variables, or predicting only some of them conditioned on others (e.g., actions or exogenous variables). In the latter case, which we call conditional modelling, the accuracy of different estimates of the conditional probabilities for a fixed dataset can vary significantly, due to the limited sampling of certain conditions. This can have negative consequences on the PSR parameter estimation process, which are not taken into account by the current state-of-the-art PSR spectral learning algorithms. In this paper, we examine closely conditional modelling within the PSR framework. We first establish a new positive but surprisingly non-trivial result: a conditional model can never be larger than the complete model. Then, we address the core shortcoming of existing PSR spectral learning methods for conditional models by incorporating an additional step in the process, which can be seen as a type of matrix denoising. We further refine this objective by adding penalty terms for violations of the system dynamics matrix structure, which improves the PSR predictive performance. Empirical evaluations on both synthetic and real datasets highlight the advantages of the proposed approach.