Publications

Rethinking Learning Dynamics in RL using Adversarial Networks

Recent years have seen tremendous progress in methods of reinforcement learning. However, most of these approaches have been trained in a st… (see more)raightforward fashion and are generally not robust to adversity, especially in the meta-RL setting. To the best of our knowledge, our work is the first to propose an adversarial training regime for Multi-Task Reinforcement Learning, which requires no manual intervention or domain knowledge of the environments. Our experiments on multiple environments in the Multi-Task Reinforcement learning domain demonstrate that the adversarial process leads to a better exploration of numerous solutions and a deeper understanding of the environment. We also adapt existing measures of causal attribution to draw insights from the skills learned, facilitating easier re-purposing of skills for adaptation to unseen environments and tasks.

2022-12-08

NeurIPS.cc/2022/Workshop/DeepRL (unknown)

openreview.net

On The Fragility of Learned Reward Functions

Lev E McKinney

Yawen Duan

David M. Krueger

Adam Gleave

Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer … (see more)reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.

2022-12-08

NeurIPS.cc/2022/Workshop/DeepRL (unknown)

doi.org

openreview.net

Training Equilibria in Reinforcement Learning

Lauro Langosco

David M. Krueger

Adam Gleave

In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria-… (see more)--policies that are stable under further training---and can converge to equilibria that are strictly suboptimal. Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization. We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states. Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}. Experiments show that policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.

2022-12-08

NeurIPS.cc/2022/Workshop/DeepRL (unknown)

openreview.net

Unleashing The Potential of Data Sharing in Ensemble Deep Reinforcement Learning

This work studies a crucial but often overlooked element of ensemble methods in deep reinforcement learning: data sharing between ensemble m… (see more)embers. We show that data sharing enables peer learning, a powerful learning process in which individual agents learn from each other's experience to significantly improve their performance. When given access to the experience of other ensemble members, even the worst agent can match or outperform the previously best agent, triggering a virtuous circle. However, we show that peer learning can be unstable when the agents' ability to learn is impaired due to overtraining on early data. We thus employ the recently proposed solution of periodic resets and show that it ensures effective peer learning. We perform extensive experiments on continuous control tasks from both dense states and pixels to demonstrate the strong effect of peer learning and its interaction with resets.

2022-12-08

NeurIPS.cc/2022/Workshop/DeepRL (unknown)

openreview.net

PyNM: a Lightweight Python implementation of Normative Modeling

Annabelle Harvey

Guillaume Dumas

The majority of studies in neuroimaging and psychiatry are focussed on case-control analysis (Marquand et al., 2019). However, case-control … (see more)relies on well-defined groups which is more the exception than the rule in biology. Psychiatric conditions are diagnosed based on symptoms alone, which makes for heterogeneity at the biological level (Marquand et al., 2016). Relying on mean differences obscures this heterogeneity and the resulting loss of information can produce unreliable results or misleading conclusions (Loth et al., 2021).

2022-12-07

Journal of Open Source Software (published)

doi.org

Consistency and Rate of Convergence of Switched Least Squares System Identification for Autonomous Markov Jump Linear Systems

Borna Sayedana

Mohammad Afshari

Peter E. Caines

Aditya Mahajan

In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state obser… (see more)vations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-dependent rate of convergence shows that, almost surely, the system identification error is

2022-12-05

IEEE Conference on Decision and Control (published)

doi.org

A modified Thompson sampling-based learning algorithm for unknown linear systems

Mukul Gagrani

Sagar Sudhakara

Aditya Mahajan

Rahul Jain

Ashutosh Nayyar

Yi Ouyang

We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This… (see more) algorithm operates in episodes of dynamic length and it is shown to have a regret bound of

2022-12-05

2022 IEEE 61st Conference on Decision and Control (CDC) (published)

doi.org

arxiv.org

Partially observable restless bandits with restarts: indexability and computation of Whittle index

Nima Akbarzadeh

Aditya Mahajan

We consider restless bandits with restarts, where the state of the active arms resets according to a known probability distribution while th… (see more)e state of the passive arms evolves in a Markovian manner. We assume that the state of the arm is observed after it is reset but not observed otherwise. We show that the model is indexable and propose an efficient algorithm to compute the Whittle index by exploiting the qualitative properties of the optimal policy. A detailed numerical study of machine repair models shows that Whittle index policy outperforms myopic policy and is close to optimal policy.

2022-12-05

IEEE Conference on Decision and Control (published)

doi.org

Pitfalls of conditional computation for multi-modal learning

Ivaxi Sheth

Mohammad Havaei

S Ebrahimi Kahou

Humans have perfected the art of learning from multiple modalities, through sensory organs. Despite impressive predictive performance on a s… (see more)ingle modality, neural networks cannot reach human level accuracy with respect to multiple modalities. This is a particularly challenging task due to variations in the structure of respective modalities. A popular method, Conditional Batch Normalization (CBN), was proposed to learn contextual features to aid a deep learning task. This uses the auxiliary data to improve representational power by learning affine transformation for Convolution Neural Networks. Despite the boost in performance by using CBN layer, our work reveals that the visual features learned by introducing auxiliary data via CBN deteriorates. We perform comprehensive experiments to evaluate the brittleness of a dataset to CBN. We show the sensitivity of CBN to the dataset, suggesting that learning from visual features could often be superior for generalization. We perform exhaustive experiments on natural images for bird classification and histology images for cancer type classification. We observe that the CBN network, learns close to no visual features on the bird classification dataset and partial visual features on the histology dataset. Our experiments reveal that CBN may encourage shortcut learning between the auxiliary data and labels.

2022-12-05

NeurIPS.cc/2022/Workshop/ICBINB (poster)

openreview.net

Thompson-Sampling Based Reinforcement Learning for Networked Control of Unknown Linear Systems

Borna Sayedana

Mohammad Afshari

Peter E. Caines

Aditya Mahajan

In recent years, there has been considerable interest in reinforcement learning for linear quadratic Gaussian (LQG) systems. In this paper, … (see more)we consider a generalization of such systems where the controller and the plant are connected over an unreliable packet drop channel. Packet drops cause the system dynamics to switch between controlled and uncontrolled modes. This switching phenomena introduces new challenges in designing learning algorithms. We identify a sufficient condition under which the regret of Thompson sampling-based reinforcement learning algorithm with dynamic episodes (TSDE) at horizon T is bounded by

2022-12-05

2022 IEEE 61st Conference on Decision and Control (CDC) (published)

doi.org

Computational brain dynamics in prosopagnosia

Simon Faghel-Soubeyrand

Anne-Raphaelle Richoz

Delphine Waeber

Jessica Woodhams

Frédéric Gosselin

Roberto Caldara

Ian Charest

2022-12-04

Journal of Vision (published)

doi.org

GRAND for Rayleigh Fading Channels

Syed Mohsin Abbas

Marwan Jalaleddine

Warren J. Gross

Guessing Random Additive Noise Decoding (GRAND) is a code-agnostic decoding technique for short-length and high-rate channel codes. GRAND at… (see more)tempts to guess the channel-induced noise by generating Test Error Patterns (TEPs), and the sequence of TEP generation is the primary distinction between GRAND variants. In this work, we extend the application of GRAND to multipath frequency non-selective Rayleigh fading communication channels, and we refer to this GRAND variant as Fading-GRAND. The proposed Fading-GRAND adapts its TEP generation to the fading conditions of the underlying communication channel, outperforming traditional channel code decoders in scenarios with L spatial diversity branches as well as scenarios with no diversity. Numerical simulation results show that the Fading-GRAND outperforms the traditional Berlekamp-Massey (B-M) decoder for decoding BCH code (127, 106) and BCH code (127, 113) by

2022-12-03

2022 IEEE Globecom Workshops (GC Wkshps) (published)

doi.org

arxiv.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications