Portrait de Haque Ishfaq

Haque Ishfaq

Collaborateur·rice alumni - McGill
Superviseur⋅e principal⋅e
Sujets de recherche
Apprentissage en ligne
Apprentissage par renforcement

Publications

Langevin Soft Actor-Critic: Efficient Exploration through Uncertainty-Driven Critic Learning
Mohammad Sami Nur Islam
Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample effici… (voir plus)ency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based
Offline Multitask Representation Learning for Reinforcement Learning
Thanh Nguyen-Tang
Songtao Feng
Raman Arora
Mengdi Wang
Ming Yin
More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling
Yixin Tan
Yu Yang
Qingfeng Lan
Jianfeng Lu
A. Rupam Mahmood
Pan Xu
Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo
Qingfeng Lan
Pan Xu
A. Rupam Mahmood
Animashree Anandkumar
Kamyar Azizzadenesheli
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (voir plus)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of
Randomized Exploration for Reinforcement Learning with General Value Function Approximation
Qiwen Cui
Viet Huy Nguyen
Alex Ayoub
Zhuoran Yang
Zhaoran Wang
Lin F. Yang
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm … (voir plus)as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class
Randomized Exploration in Reinforcement Learning with General Value Function Approximation
Qiwen Cui
Viet Bang Nguyen
Alex Ayoub
Zhuoran Yang
Zhaoran Wang
Lin Yang
Randomized Least Squares Policy Optimization
Zhuoran Yang
Andrei-Stefan Lupu
Viet Bang Nguyen
Lewis Liu
Zhaoran Wang
Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. … (voir plus)However, designing provably efficient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper confidence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves (cid:101) O ( d 3 / 2 H 3 / 2 √ T ) worst-case (frequentist) regret, where H is the number of episodes, T is the total number of steps and d is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efficient PO algorithms.