Portrait of Alexandre Piché is unavailable

Alexandre Piché

PhD - Université de Montréal
Supervisor

Publications

Self-evaluation and self-prompting to improve the reliability of LLMs
Alexandre Piché
Aristides Milios
In order to safely deploy Large Language Models (LLMs), they must be capable of dynamically adapting their behavior based on their level of … (see more)knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach since it depends on the internal knowledge of an LLM. By default, LLMs are trained to maximize the next token likelihood which does not teach the model to modulate its answer based on its level of uncertainty. In order to learn self-restraint, we devise a simple objective that can encourage the model to produce generation that the model is confident in. To optimize this objective, we introduce ReSearch, an iterative search algorithm based on self-evaluation and self-prompting. Our method results in fewer hallucinations overall, both for known and unknown topics, as the model learns to selectively restrain itself. In addition, our method elegantly incorporates the ability to decline, when the model assesses that it cannot provide a response without a high proportion of hallucination.
Self-evaluation and self-prompting to improve the reliability of LLMs
Alexandre Piché
Aristides Milios
In order to safely deploy Large Language Models (LLMs), they must be capable of dynamically adapting their behavior based on their level of … (see more)knowledge and uncertainty associated with specific topics. This adaptive behavior, which we refer to as self-restraint, is non-trivial to teach since it depends on the internal knowledge of an LLM. By default, LLMs are trained to maximize the next token likelihood which does not teach the model to modulate its answer based on its level of uncertainty. In order to learn self-restraint, we devise a simple objective that can encourage the model to produce generation that the model is confident in. To optimize this objective, we introduce ReSearch, an iterative search algorithm based on self-evaluation and self-prompting. Our method results in fewer hallucinations overall, both for known and unknown topics, as the model learns to selectively restrain itself. In addition, our method elegantly incorporates the ability to decline, when the model assesses that it cannot provide a response without a high proportion of hallucination.
Bridging the Gap Between Target Networks and Functional Regularization
Alexandre Piché
Valentin Thomas
Joseph Marino
Rafael Pardinas
Gian Maria Marconi
Mohammad Emtiyaz Khan
Causal Discovery with Language Models as Imperfect Experts
Stephanie Long
Alexandre Piché
Valentina Zantedeschi
Tibor Schuster
Understanding the causal relationships that underlie a system is a fundamental prerequisite to accurate decision-making. In this work, we ex… (see more)plore how expert knowledge can be used to improve the data-driven identification of causal graphs, beyond Markov equivalence classes. In doing so, we consider a setting where we can query an expert about the orientation of causal relationships between variables, but where the expert may provide erroneous information. We propose strategies for amending such expert knowledge based on consistency properties, e.g., acyclicity and conditional independencies in the equivalence class. We then report a case study, on real data, where a large language model is used as an imperfect expert.
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels
Sai Rajeswar
Pietro Mazzaglia
Tim Verbelen
Alexandre Piché
Bart Dhoedt
Alexandre Lacoste
Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels
Sai Rajeswar
Pietro Mazzaglia
Tim Verbelen
Alexandre Piché
Bart Dhoedt
Alexandre Lacoste
Controlling artificial agents from visual sensory data is an arduous task. Reinforcement learning (RL) algorithms can succeed but require la… (see more)rge amounts of interactions between the agent and the environment. To alleviate the issue, unsupervised RL proposes to employ self-supervised interaction and learning, for adapting faster to future tasks. Yet, as shown in the Unsupervised RL Benchmark (URLB; Laskin et al. 2021), whether current unsupervised strategies can improve generalization capabilities is still unclear, especially in visual control settings. In this work, we study the URLB and propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent, and a task-aware fine-tuning strategy combined with a new proposed hybrid planner, Dyna-MPC, to adapt the agent for downstream tasks. On URLB, our method obtains 93.59% overall normalized performance, surpassing previous baselines by a staggering margin. The approach is empirically evaluated through a large-scale empirical study, which we use to validate our design choices and analyze our models. We also show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation. Project website: https://masteringurlb.github.io/
Exploring validation metrics for offline model-based optimisation
Christopher Beckham
Alexandre Piché
David Vazquez
In offline model-based optimisation (MBO) we are interested in using machine learning to de-sign candidates that maximise some measure of d… (see more)esirability through an expensive but real-world scoring process. Offline MBO tries to approximate this expensive scoring function and use that to evaluate generated designs, however evaluation is non-exact because one approximation is being evaluated with another. Instead, we ask ourselves: if we did have the real world scoring function at hand, what cheap-to-compute validation metrics would correlate best with this? Since the real-world scoring function is available for simulated MBO datasets, insights obtained from this can be transferred over to real-world offline MBO tasks where the real-world scoring function is expensive to compute. To address this, we propose a conceptual evaluation framework that is amenable to measuring extrapolation, and apply this to conditional denoising diffusion models. Empirically, we find that two validation metrics – agreement and Frechet distance – correlate quite well with the ground truth. When there is high variability in conditional generation, feedback is required in the form of an approximated version of the real-world scoring function. Furthermore, we find that generating high-scoring samples may require heavily weighting the generative model in favour of sample quality, potentially at the cost of sample diversity.
Implicit Offline Reinforcement Learning via Supervised Learning
Alexandre Piché
Rafael Pardinas
David Vazquez
Igor Mordatch
Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset of varied b… (see more)ehaviors. It is as simple as supervised learning and Behavior Cloning (BC) but takes advantage of the return information. On BC tasks, implicit models have been shown to match or outperform explicit ones. Despite the benefits of using implicit models to learn robotic skills via BC, Offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show how closely related our implicit methods are to other popular RL via Supervised Learning algorithms.