TRAIL : IA responsable pour les professionnels et les leaders
Apprenez à intégrer des pratique d'IA responsable dans votre organisation avec le programme TRAIL. Inscrivez-vous à la prochaine cohorte qui débutera le 15 avril.
Avantage IA : productivité dans la fonction publique
Apprenez à tirer parti de l’IA générative pour soutenir et améliorer votre productivité au travail. La prochaine cohorte se déroulera en ligne les 28 et 30 avril 2026.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Lecteur Multimédia
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Publications
Implicit Offline Reinforcement Learning via Supervised Learning
Offline Reinforcement Learning (RL) via Supervised Learning is a simple and effective way to learn robotic skills from a dataset of varied b… (voir plus)ehaviors. It is as simple as supervised learning and Behavior Cloning (BC) but takes advantage of the return information. On BC tasks, implicit models have been shown to match or outperform explicit ones. Despite the benefits of using implicit models to learn robotic skills via BC, Offline RL via Supervised Learning algorithms have been limited to explicit models. We show how implicit models leverage return information and match or outperform explicit algorithms to acquire robotic skills from fixed datasets. Furthermore, we show how closely related our implicit methods are to other popular RL via Supervised Learning algorithms.
Informing the development of an outcome set and banks of items to measure mobility among individuals with acquired brain injury using natural language processing
The banks of items of mobility domains represent a first step toward establishing a comprehensive outcome set and a common language of mobil… (voir plus)ity to develop the ontology. It enables researchers and healthcare professionals to begin exposing the content of mobility measures as a way to assess mobility comprehensively.
The deep reinforcement learning (RL) framework has shown great promise to tackle sequential decision-making problems, where the agent learns… (voir plus) to behave optimally through interactions with the environment and receiving rewards. The ability of an RL agent to learn different reward functions concurrently has many benefits, such as the decomposition of task rewards and promoting skill reuse. In this paper, we consider the problem of continuous control for robot manipulation tasks with an explicit representation that promotes skill reuse while learning multiple tasks with similar reward functions. Our approach relies on two key concepts: successor features (SFs), a value function representation that decouples the dynamics of the environment from the rewards, and an actor-critic framework that incorporates the learned SFs representation.
SFs form a natural bridge between model-based and model-free RL methods. We first show how to learn a decomposable representation required by SFs as a pre-training stage. The proposed architecture is able to learn decoupled state and reward feature representations for non-linear reward functions. We then evaluate the feasibility of integrating SFs into an actor-critic framework, which is more tailored for tasks solved with deep RL algorithms. The approach is empirically tested on non-trivial continuous control problems with compositional structure built into the reward functions of the tasks.
The success of Reinforcement Learning (RL) heavily relies on the ability to learn robust representations from the observations of the enviro… (voir plus)nment. In most cases, the representations learned purely by the reinforcement learning loss can differ vastly across states depending on how the value functions change. However, the representations learned need not be very specific to the task at hand. Relying only on the RL objective may yield representations that vary greatly across successive time steps. In addition, since the RL loss has a changing target, the representations learned would depend on how good the current values/policies are. Thus, disentangling the representations from the main task would allow them to focus not only on the task-specific features but also the environment dynamics. To this end, we propose locally constrained representations, where an auxiliary loss forces the state representations to be predictable by the representations of the neighboring states. This encourages the representations to be driven not only by the value/policy learning but also by an additional loss that constrains the representations from over-fitting to the value loss. We evaluate the proposed method on several known benchmarks and observe strong performance. Especially in continuous control tasks, our experiments show a significant performance improvement.
One of the key behavioral characteristics used in neuroscience to determine whether the subject of study -- be it a rodent or a human -- exh… (voir plus)ibits model-based learning is effective adaptation to local changes in the environment, a particular form of adaptivity that is the focus of this work. In reinforcement learning, however, recent work has shown that modern deep model-based reinforcement-learning (MBRL) methods adapt poorly to local environment changes. An explanation for this mismatch is that MBRL methods are typically designed with sample-efficiency on a single task in mind and the requirements for effective adaptation are substantially higher, both in terms of the learned world model and the planning routine. One particularly challenging requirement is that the learned world model has to be sufficiently accurate throughout relevant parts of the state-space. This is challenging for deep-learning-based world models due to catastrophic forgetting. And while a replay buffer can mitigate the effects of catastrophic forgetting, the traditional first-in-first-out replay buffer precludes effective adaptation due to maintaining stale data. In this work, we show that a conceptually simple variation of this traditional replay buffer is able to overcome this limitation. By removing only samples from the buffer from the local neighbourhood of the newly observed samples, deep world models can be built that maintain their accuracy across the state-space, while also being able to effectively adapt to local changes in the reward function. We demonstrate this by applying our replay-buffer variation to a deep version of the classical Dyna method, as well as to recent methods such as PlaNet and DreamerV2, demonstrating that deep model-based methods can adapt effectively as well to local changes in the environment.
Recent years have seen tremendous progress in methods of reinforcement learning. However, most of these approaches have been trained in a st… (voir plus)raightforward fashion and are generally not robust to adversity, especially in the meta-RL setting. To the best of our knowledge, our work is the first to propose an adversarial training regime for Multi-Task Reinforcement Learning, which requires no manual intervention or domain knowledge of the environments. Our experiments on multiple environments in the Multi-Task Reinforcement learning domain demonstrate that the adversarial process leads to a better exploration of numerous solutions and a deeper understanding of the environment. We also adapt existing measures of causal attribution to draw insights from the skills learned, facilitating easier re-purposing of skills for adaptation to unseen environments and tasks.
Reward functions are notoriously difficult to specify, especially for tasks with complex goals. Reward learning approaches attempt to infer … (voir plus)reward functions from human feedback and preferences. Prior works on reward learning have mainly focused on the performance of policies trained alongside the reward function. This practice, however, may fail to detect learned rewards that are not capable of training new policies from scratch and thus do not capture the intended behavior. Our work focuses on demonstrating and studying the causes of these relearning failures in the domain of preference-based reward learning. We demonstrate with experiments in tabular and continuous control environments that the severity of relearning failures can be sensitive to changes in reward model design and the trajectory dataset composition. Based on our findings, we emphasize the need for more retraining-based evaluations in the literature.
In partially observable environments, reinforcement learning algorithms such as policy gradient and Q-learning may have multiple equilibria-… (voir plus)--policies that are stable under further training---and can converge to equilibria that are strictly suboptimal.
Prior work blames insufficient exploration, but suboptimal equilibria can arise despite full exploration and other favorable circumstances like a flexible policy parametrization.
We show theoretically that the core problem is that in partially observed environments, an agent's past actions induce a distribution on hidden states.
Equipping the policy with memory helps it model the hidden state and leads to convergence to a higher reward equilibrium, \emph{even when there exists a memoryless optimal policy}.
Experiments show that
policies with insufficient memory tend to learn to use the environment as auxiliary memory, and parameter noise helps policies escape suboptimal equilibria.
This work studies a crucial but often overlooked element of ensemble methods in deep reinforcement learning: data sharing between ensemble m… (voir plus)embers. We show that data sharing enables peer learning, a powerful learning process in which individual agents learn from each other's experience to significantly improve their performance. When given access to the experience of other ensemble members, even the worst agent can match or outperform the previously best agent, triggering a virtuous circle. However, we show that peer learning can be unstable when the agents' ability to learn is impaired due to overtraining on early data. We thus employ the recently proposed solution of periodic resets and show that it ensures effective peer learning. We perform extensive experiments on continuous control tasks from both dense states and pixels to demonstrate the strong effect of peer learning and its interaction with resets.
The majority of studies in neuroimaging and psychiatry are focussed on case-control analysis (Marquand et al., 2019). However, case-control … (voir plus)relies on well-defined groups which is more the exception than the rule in biology. Psychiatric conditions are diagnosed based on symptoms alone, which makes for heterogeneity at the biological level (Marquand et al., 2016). Relying on mean differences obscures this heterogeneity and the resulting loss of information can produce unreliable results or misleading conclusions (Loth et al., 2021).
In this paper, we investigate the problem of system identification for autonomous Markov jump linear systems (MJS) with complete state obser… (voir plus)vations. We propose switched least squares method for identification of MJS, show that this method is strongly consistent, and derive data-dependent and data-independent rates of convergence. In particular, our data-dependent rate of convergence shows that, almost surely, the system identification error is
We revisit the Thompson sampling-based learning algorithm for controlling an unknown linear system with quadratic cost proposed in [1]. This… (voir plus) algorithm operates in episodes of dynamic length and it is shown to have a regret bound of
2022-12-05
2022 IEEE 61st Conference on Decision and Control (CDC) (publié)