Audrey Durand

Maxime Thibault

2021-05-05

J. Am. Medical Informatics Assoc. (publié)

Routine Bandits: Minimizing Regret on Recurring Problems

Hassan Saber

L'eo Saci

Odalric-Ambrym Maillard

2020-12-31

ECML/PKDD (publié)

Handling Black Swan Events in Deep Learning with Diversely Extrapolated Neural Networks

Maxime Wabartha

Vincent François-Lavet

By virtue of their expressive power, neural networks (NNs) are well suited to fitting large, complex datasets, yet they are also known to … (voir plus)produce similar predictions for points outside the training distribution. As such, they are, like humans, under the influence of the Black Swan theory: models tend to be extremely "surprised" by rare events, leading to potentially disastrous consequences, while justifying these same events in hindsight. To avoid this pitfall, we introduce DENN, an ensemble approach building a set of Diversely Extrapolated Neural Networks that fits the training data and is able to generalize more diversely when extrapolating to novel data points. This leads DENN to output highly uncertain predictions for unexpected inputs. We achieve this by adding a diversity term in the loss function used to train the model, computed at specific inputs. We first illustrate the usefulness of the method on a low-dimensional regression problem. Then, we show how the loss can be adapted to tackle anomaly detection during classification, as well as safe imitation learning problems.

2020-06-30

International Joint Conference on Artificial Intelligence (publié)

Old Dog Learns New Tricks: Randomized UCB for Bandit Problems

Sharan Vaswani

Abbas Mehrabian

Branislav Kveton

2020-06-02

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Leveraging exploration in off-policy algorithms via normalizing flows

Bogdan Mazoure

Thang Doan

R Devon Hjelm

The ability to discover approximately optimal policies in domains with sparse rewards is crucial to applying reinforcement learning (RL) in … (voir plus)many real-world scenarios. Approaches such as neural density models and continuous exploration (e.g., Go-Explore) have been proposed to maintain the high exploration rate necessary to find high performing and generalizable policies. Soft actor-critic(SAC) is another method for improving exploration that aims to combine efficient learning via off-policy updates while maximizing the policy entropy. In this work, we extend SAC to a richer class of probability distributions (e.g., multimodal) through normalizing flows (NF) and show that this significantly improves performance by accelerating the discovery of good policies while using much smaller policy representations. Our approach, which we call SAC-NF, is a simple, efficient,easy-to-implement modification and improvement to SAC on continuous control baselines such as MuJoCo and PyBullet Roboschool domains. Finally, SAC-NF does this while being significantly parameter efficient, using as few as 5.5% the parameters for an equivalent SAC model.

2020-05-11

Proceedings of the Conference on Robot Learning (publié)

proceedings.mlr.press

Literature Mining for Incorporating Inductive Bias in Biomedical Prediction Tasks (Student Abstract)

Qizhen Zhang

2020-04-02

AAAI Conference on Artificial Intelligence (publié)

Deep interpretability for GWAS

Deepak Sharma

Marc-Andr Legault

Louis-Philippe Lemieux Perreault

Audrey Lemaon

Marie-Pierre Dub

Genome-Wide Association Studies are typically conducted using linear models to find genetic variants associated with common diseases. In the… (voir plus)se studies, association testing is done on a variant-by-variant basis, possibly missing out on non-linear interaction effects between variants. Deep networks can be used to model these interactions, but they are difficult to train and interpret on large genetic datasets. We propose a method that uses the gradient based deep interpretability technique named DeepLIFT to show that known diabetes genetic risk factors can be identified using deep models along with possibly novel associations.

2019-12-31

arXiv (prépublication)

arxiv.org

Leveraging Observations in Bandits: Between Risks and Benefits.

Andrei Lupu

Doina Precup

Imitation learning has been widely used to speed up learning in novice agents, by allowing them to leverage existing data from experts. Allo… (voir plus)wing an agent to be influenced by external observations can benefit to the learning process, but it also puts the agent at risk of following sub-optimal behaviours. In this paper, we study this problem in the context of bandits. More specifically, we consider that an agent (learner) is interacting with a bandit-style decision task, but can also observe a target policy interacting with the same environment. The learner observes only the target’s actions, not the rewards obtained. We introduce a new bandit optimism modifier that uses conditional optimism contingent on the actions of the target in order to guide the agent’s exploration. We analyze the effect of this modification on the well-known Upper Confidence Bound algorithm by proving that it preserves a regret upper-bound of order O(lnT), even in the presence of a very poor target, and we derive the dependency of the expected regret on the general target policy. We provide empirical results showing both great benefits as well as certain limitations inherent to observational learning in the multi-armed bandit setting. Experiments are conducted using targets satisfying theoretical assumptions with high probability, thus narrowing the gap between theory and application.

2019-07-16

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

On-line Adaptative Curriculum Learning for GANs

Thang Doan

Joao Monteiro

Isabela Albuquerque

Bogdan Mazoure

R Devon Hjelm

Generative Adversarial Networks (GANs) can successfully approximate a probability distribution and produce realistic samples. However, open … (voir plus)questions such as sufficient convergence conditions and mode collapse still persist. In this paper, we build on existing work in the area by proposing a novel framework for training the generator against an ensemble of discriminator networks, which can be seen as a one-student/multiple-teachers setting. We formalize this problem within the full-information adversarial bandit framework, where we evaluate the capability of an algorithm to select mixtures of discriminators for providing the generator with feedback during learning. To this end, we propose a reward function which reflects the progress made by the generator and dynamically update the mixture weights allocated to each discriminator. We also draw connections between our algorithm and stochastic optimization methods and then show that existing approaches using multiple discriminators in literature can be recovered from our framework. We argue that less expressive discriminators are smoother and have a general coarse grained view of the modes map, which enforces the generator to cover a wide portion of the data distribution support. On the other hand, highly expressive discriminators ensure samples quality. Finally, experimental results show that our approach improves samples quality and diversity over existing baselines by effectively learning a curriculum. These results also support the claim that weaker discriminators have higher entropy improving modes coverage. Keywords: multiple discriminators, curriculum learning, multiple resolutions discriminators, multi-armed bandits, generative adversarial networks, smooth discriminators, multi-discriminator gan training, multiple experts.

2019-07-16

Proceedings of the AAAI Conference on Artificial Intelligence (publié)

arxiv.org

Attraction-Repulsion Actor-Critic for Continuous Control Reinforcement Learning

Thang Doan

Bogdan Mazoure

R Devon Hjelm

Continuous control tasks in reinforcement learning are important because they provide an important framework for learning in high-dimensiona… (voir plus)l state spaces with deceptive rewards, where the agent can easily become trapped into suboptimal solutions. One way to avoid local optima is to use a population of agents to ensure coverage of the policy space, yet learning a population with the "best" coverage is still an open problem. In this work, we present a novel approach to population-based RL in continuous control that leverages properties of normalizing flows to perform attractive and repulsive operations between current members of the population and previously observed policies. Empirical results on the MuJoCo suite demonstrate a high performance gain for our algorithm compared to prior work, including Soft-Actor Critic (SAC).

2018-12-31

arXiv (prépublication)

openreview.net

Contextual Bandits for Adapting Treatment in a Mouse Model of de Novo Carcinogenesis

Charis Achilleos

Demetris Iacovides

Katerina Strati

Georgios D. Mitsis