Haque Ishfaq

Langevin Soft Actor-Critic: Efficient Exploration Through Uncertainty-Driven Critic Learning

Sami Nur Islam

Existing actor-critic algorithms, which are popular for continuous control reinforcement learning (RL) tasks, suffer from poor sample effici… (see more)ency due to lack of principled exploration mechanism within them. Motivated by the success of Thompson sampling for efficient exploration in RL, we propose a novel model-free RL algorithm, Langevin Soft Actor Critic (LSAC), which prioritizes enhancing critic learning through uncertainty estimation over policy optimization. LSAC employs three key innovations: approximate Thompson sampling through distributional Langevin Monte Carlo (LMC) based

2025-04-22

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

More Efficient Randomized Exploration for Reinforcement Learning via Approximate Sampling

Haque Ishfaq

Yixin Tan

Yu Yang

Qingfeng Lan

Jianfeng Lu

A. Rupam Mahmood

Doina Precup

Pan Xu

2024-05-14

Reinforcement Learning Conference (published)

doi.org

openreview.net

Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo

Haque Ishfaq

Qingfeng Lan

Pan Xu

A. Rupam Mahmood

Doina Precup

Anima Anandkumar

Kamyar Azizzadenesheli

We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL). One of the key shortcom… (see more)ings of existing Thompson sampling algorithms is the need to perform a Gaussian approximation of the posterior distribution, which is not a good surrogate in most practical settings. We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo, an efficient type of Markov Chain Monte Carlo (MCMC) method. Our method only needs to perform noisy gradient descent updates to learn the exact posterior distribution of the Q function, which makes our approach easy to deploy in deep RL. We provide a rigorous theoretical analysis for the proposed method and demonstrate that, in the linear Markov decision process (linear MDP) setting, it has a regret bound of

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

Offline Multitask Representation Learning for Reinforcement Learning

Raman Arora

Haque Ishfaq

Songtao Feng

Thanh Nguyen-Tang

Doina Precup

Mengdi Wang

Ming Yin

We study offline multitask representation learning in reinforcement learning (RL), where a learner is provided with an offline dataset from … (see more)different tasks that share a common representation and is asked to learn the shared representation. We theoretically investigate offline multitask low-rank RL, and propose a new algorithm called MORL for offline multitask representation learning. Furthermore, we examine downstream RL in reward-free, offline and online scenarios, where a new task is introduced to the agent that shares the same representation as the upstream offline tasks. Our theoretical results demonstrate the benefits of using the learned representation from the upstream offline task instead of directly learning the representation of the low-rank model.

2023-12-31

Advances in Neural Information Processing Systems 37 (published)

doi.org

openreview.net

Randomized Exploration for Reinforcement Learning with General Value Function Approximation

Haque Ishfaq

Qiwen Cui

Viet Nguyen

Alex Ayoub

Zhuoran Yang

Zhaoran Wang

Doina Precup

Lin F. Yang

2021-06-14

ArXiv (preprint)

proceedings.mlr.press

Randomized Least Squares Policy Optimization

Haque Ishfaq

Zhuoran Yang

Andrei Lupu

Viet Bang Nguyen

Lewis Liu

Riashat Islam

Zhaoran Wang

Doina Precup

Policy Optimization (PO) methods with function approximation are one of the most popular classes of Reinforcement Learning (RL) algorithms. … (see more)However, designing provably efﬁcient policy optimization algorithms remains a challenge. Recent work in this area has focused on incorporating upper conﬁdence bound (UCB)-style bonuses to drive exploration in policy optimization. In this paper, we present Randomized Least Squares Policy Optimization (RLSPO) which is inspired by Thompson Sampling. We prove that, in an episodic linear kernel MDP setting, RLSPO achieves (cid:101) O ( d 3 / 2 H 3 / 2 √ T ) worst-case (frequentist) regret, where H is the number of episodes, T is the total number of steps and d is the feature dimension. Finally, we evaluate RLSPO empirically and show that it is competitive with existing provably efﬁcient PO algorithms.

2020-12-31

(published)

www.semanticscholar.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Haque Ishfaq

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Haque Ishfaq

Publications