Paul Barde

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

Maxime C. Cohen

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highl… (see more)y accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

2026-05-17

arXiv (preprint)

doi.org

arxiv.org

A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

Paul Barde

Jakob Foerster

Derek Nowrouzezahrai

Amy Zhang

Training multiple agents to coordinate is an essential problem with applications in robotics, game theory, economics, and social sciences. H… (see more)owever, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to what we call the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) coordination challenges, two issues at which current offline MARL algorithms fail. Concretely, we reveal that the prevalent model-free methods are severely deficient and cannot handle coordination-intensive offline multi-agent tasks in either toy or MuJoCo domains. To address this setback, we emphasize the importance of inter-agent interactions and propose the very first model-based offline MARL method. Our resulting algorithm, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO) generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. This simple model-based solution solves the coordination-intensive offline tasks, significantly outperforming the prevalent model-free methods even under severe partial observability and with learned world models.

2023-12-31

AAMAS (published)

doi.org

arxiv.org

From Words to Blocks: Building Objects by Grounding Language Models with Reinforcement Learning

Michael Ahn

Anthony Brohan

Noah Brown

liang Dai

Dan Su

Holy Lovenia Ziwei Bryan Wilie

Tiezheng Yu

Willy Chung

Quyet V. Do

Paul Barde

Tristan Karch

D. Nowrouzezahrai

C. Bonial

Mitchell Abrams

David R. Traum

Hyung Won

Le Hou

Shayne Longpre

Yi Zoph

William Tay … (see 32 more)

Eric Fedus

Xuezhi Li

Lasse Espeholt

Hubert Soyer

Remi Munos

Karen Si-801

Vlad Mnih

Tom Ward

Yotam Doron

Wenlong Huang

Pieter Abbeel

Deepak Pathak

Julia Kiseleva

Ziming Li

Mohammad Aliannejadi

Shrestha Mohanty

Maartje Ter Hoeve

Mikhail Burtsev

Alexey Skrynnik

Artem Zholus

A. Panov

Kavya Srinet

A. Szlam

Yuxuan Sun

Katja Hofmann

Marc-Alexandre Côté

Ahmed Hamid Awadallah

Linar Abdrazakov

Igor Churin

Putra Manggala

Kata Naszádi

Michiel Van Der Meer

Leveraging pre-trained language models to gen-001 erate action plans for embodied agents is an 002 emerging research direction. However, exe… (see more)-003 cuting instructions in real or simulated envi-004 ronments necessitates verifying the feasibility 005 of actions and their relevance in achieving a 006 goal. We introduce a novel method that in-007 tegrates a language model and reinforcement 008 learning for constructing objects in a Minecraft-009 like environment, based on natural language 010 instructions. Our method generates a set of 011 consistently achievable sub-goals derived from 012 the instructions and subsequently completes the 013 associated sub-tasks using a pre-trained RL pol-014 icy. We employ the IGLU competition, which 015 is based on the Minecraft-like simulator, as our 016 test environment, and compare our approach 017 to the competition’s top-performing solutions. 018 Our approach outperforms existing solutions in 019 terms of both the quality of the language model 020 and the quality of the structures built within the 021 IGLU environment. 022

2022-12-31

(published)

www.semanticscholar.org

Learning to Guide and to Be Guided in the Architect-Builder Problem

Paul Barde

Tristan Karch

Derek Nowrouzezahrai

Clément Moulin-Frier

Christopher Pal

Pierre-Yves Oudeyer

We are interested in interactive agents that learn to coordinate, namely, a …

2022-01-27

ICLR.cc/2022/Conference (poster)

doi.org

openreview.net

Adversarial Soft Advantage Fitting: Imitation Learning without Policy Optimization

Christopher Pal

Adversarial Imitation Learning alternates between learning a discriminator -- which tells apart expert's demonstrations from generated ones … (see more)-- and a generator's policy to produce trajectories that can fool this discriminator. This alternated optimization is known to be delicate in practice since it compounds unstable adversarial training with brittle and sample-inefficient reinforcement learning. We propose to remove the burden of the policy optimization steps by leveraging a novel discriminator formulation. Specifically, our discriminator is explicitly conditioned on two policies: the one from the previous generator's iteration and a learnable policy. When optimized, this discriminator directly learns the optimal generator's policy. Consequently, our discriminator's update solves the generator's optimization problem for free: learning a policy that imitates the expert does not require an additional optimization loop. This formulation effectively cuts by half the implementation and computational burden of Adversarial Imitation Learning algorithms by removing the Reinforcement Learning phase altogether. We show on a variety of tasks that our simpler approach is competitive to prevalent Imitation Learning methods.

2019-12-31

Advances in Neural Information Processing Systems 33 (NeurIPS 2020) (published)

doi.org

arxiv.org

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Paul Barde

Publications