Erick Delage

Offline reinforcement learning learns policies from fixed datasets without further environment interaction. A key challenge in this setting … (voir plus)is epistemic uncertainty, arising from limited or biased data coverage, particularly when the behavior policy systematically avoids certain actions. This can lead to inaccurate value estimates and unreliable generalization. Ensemble-based methods like SAC-N mitigate this by conservatively estimating Q-values using the ensemble minimum, but they require large ensembles and often conflate epistemic with aleatoric uncertainty. To address these limitations, we propose a unified and generalizable framework that replaces discrete ensembles with compact uncertainty sets over Q-values. %We further introduce an Epinet based model that directly shapes the uncertainty sets to optimize the cumulative reward under the robust Bellman objective without relying on ensembles. We also introduce a benchmark for evaluating offline RL algorithms under risk-sensitive behavior policies, and demonstrate that our method achieves improved robustness and generalization over ensemble-based baselines across both tabular and continuous state domains.

2026-04-07

arXiv (prépublication)

Joint Rolling Stock and Crew Scheduling with Multi-train Composition in Urban Rail Networks

Entai Wang

Lixing Yang

Jean-François Cordeau

Ziyou Gao

Yossiri Adulyasak

Rolling stock scheduling and crew scheduling are two fundamental problems that arise in the planning of urban rail operations and that are e… (voir plus)specially important in the case of flexible operations in real-world networks. These problems are often solved separately and sequentially in different planning stages, resulting in limited options to adjust crew schedules after rolling stock decisions have been made. To better adjust these two decision-making processes and achieve better solutions, this paper studies a joint rolling stock and crew scheduling problem in urban rail networks. A novel optimization model is formulated with the aim of reducing the operational cost of rolling stock units and crew members. In addition, the multi-train composition mode is considered to adequately match different frequency requirements and rolling stock transport capacities. To solve the model, a customized branch-and-price-and-cut solution algorithm is proposed to find the optimal schedule schemes, in which Benders decomposition is used to solve the linear programming relaxation of the path-based reformulation. Two customized column generation methods with label correcting are embedded to solve the master problem and pricing subproblem for generating paths (columns) corresponding to rolling stock units and crew groups, respectively. Finally, a branch-and-bound procedure with several acceleration techniques is proposed to find integer solutions. To demonstrate the computational performance and the robustness of the proposed approaches, a series of numerical experiments are performed in real-world instances of the Beijing urban rail network under different settings. The computational results confirm the high efficiency of the solution methodology and the benefits of the flexible operation schemes based on the solutions found by the proposed methods. Funding: This work was supported by National Natural Science Foundation of China [Grants 72288101, 72322022, 72371015]. The first author sincerely thanks the China Scholarship Council for supporting his visiting PhD program [Grant 202407090173]. Supplemental Material: The electronic companion is available at https://doi.org/10.1287/trsc.2024.0905 .

2026-02-16

Transportation Science (publié)

Reward Redistribution for CVaR MDPs using a Bellman Operator on L-infinity

Tail-end risk measures such as static conditional value-at-risk (CVaR) are used in safety-critical applications to prevent rare, yet catastr… (voir plus)ophic events. Unlike risk-neutral objectives, the static CVaR of the return depends on entire trajectories without admitting a recursive Bellman decomposition in the underlying Markov decision process. A classical resolution relies on state augmentation with a continuous variable. However, unless restricted to a specialized class of admissible value functions, this formulation induces sparse rewards and degenerate fixed points. In this work, we propose a novel formulation of the static CVaR objective based on augmentation. Our alternative approach leads to a Bellman operator with: (1) dense per-step rewards; (2) contracting properties on the full space of bounded value functions. Building on this theoretical foundation, we develop risk-averse value iteration and model-free Q-learning algorithms that rely on discretized augmented states. We further provide convergence guarantees and approximation error bounds due to discretization. Empirical results demonstrate that our algorithms successfully learn CVaR-sensitive policies and achieve effective performance-safety trade-offs.

2026-02-02

ArXiv (prépublication)

Boosting CVaR Policy Optimization with Quantile Gradients

Yudong Luo

Optimizing Conditional Value-at-risk (CVaR) using policy gradient (a.k.a CVaR-PG) faces significant challenges of sample inefficiency. This … (voir plus)inefficiency stems from the fact that it focuses on tail-end performance and overlooks many sampled trajectories. We address this problem by augmenting CVaR with an expected quantile term. Quantile optimization admits a dynamic programming formulation that leverages all sampled data, thus improves sample efficiency. This does not alter the CVaR objective since CVaR corresponds to the expectation of quantile over the tail. Empirical results in domains with verifiable risk-averse behavior show that our algorithm within the Markovian policy class substantially improves upon CVaR-PG and consistently outperforms other existing methods.

2026-01-28

arXiv (prépublication)

Planning and Learning in Average Risk-aware MDPs

Weikai Wang

For continuing tasks, average cost Markov decision processes have well-documented value and can be solved using efficient algorithms. Howeve… (voir plus)r, it explicitly assumes that the agent is risk-neutral. In this work, we extend risk-neutral algorithms to accommodate the more general class of dynamic risk measures. Specifically, we propose a relative value iteration (RVI) algorithm for planning and design two model-free Q-learning algorithms, namely a generic algorithm based on the multi-level Monte Carlo (MLMC) method, and an off-policy algorithm dedicated to utility-based shortfall risk measures. Both the RVI and MLMC-based Q-learning algorithms are proven to converge to optimality. Numerical experiments validate our analysis, confirm empirically the convergence of the off-policy algorithm, and demonstrate that our approach enables the identification of policies that are finely tuned to the intricate risk-awareness of the agent that they serve.

2025-09-17

NeurIPS.cc/2025/Conference (poster)

openreview.net

What Matters when Modeling Human Behavior using Imitation Learning?

As AI systems become increasingly embedded in human decision-making process, aligning their behavior with human values is critical to ensuri… (voir plus)ng safe and trustworthy deployment. A central approach to AI Alignment called Imitation Learning (IL), trains a learner to directly mimic desirable human behaviors from expert demonstrations. However, standard IL methods assume that (1) experts act to optimize expected returns; (2) expert policies are Markovian. Both assumptions are inconsistent with empirical findings from behavioral economics, according to which humans are (1) risk-sensitive; and (2) make decisions based on past experience. In this work, we examine the implications of risk sensitivity for IL and show that standard approaches do not capture all optimal policies under risk-sensitive decision criteria. By characterizing these expert policies, we identify key limitations of existing IL algorithms in replicating expert performance in risk-sensitive settings. Our findings underscore the need for new IL frameworks that account for both risk-aware preferences and temporal dependencies to faithfully align AI behavior with human experts.

2025-06-09

ICML.cc/2025/Workshop/MoFA (poster)

openreview.net

Fair Resource Allocation in Weakly Coupled Markov Decision Processes

Xiaohui Tu

Yossiri Adulyasak

Nima Akbarzadeh

We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes, where r… (voir plus)esource constraints couple the action spaces of

2025-04-22

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Planning and Learning in Risk-Aware Restless Multi-Arm Bandits

Nima Akbarzadeh

Yossiri Adulyasak

2025-04-22

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Jia Lin Hau

Esther Derman

Mohammad Ghavamzadeh

Marek Petrik

In Markov decision processes (MDPs), quantile risk measures such as Value-at-Risk are a standard metric for modeling RL agents' preferences … (voir plus)for certain outcomes. This paper proposes a new Q-learning algorithm for quantile optimization in MDPs with strong convergence and performance guarantees. The algorithm leverages a new, simple dynamic program (DP) decomposition for quantile MDPs. Compared with prior work, our DP decomposition requires neither known transition probabilities nor solving complex saddle point equations and serves as a suitable foundation for other model-free RL algorithms. Our numerical results in tabular domains show that our Q-learning algorithm converges to its DP variant and outperforms earlier algorithms.

2025-04-22

Proceedings of The 28th International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

Learning-to-Optimize for Consolidation and Transshipment in Multi-store Order Delivery

Xin Wang

Okan Arslan

Jean-François Cordeau

2024-12-31

SSRN Electronic Journal (accepté)

A Survey of Contextual Optimization Methods for Decision-Making under Uncertainty

Utsav Sadana

Abhilash Chenreddy

Alexandre Forel

Emma Frejinger

Thibaut Vidal

2024-12-31

European Journal of Operational Research (publié)

Conformal Inverse Optimization

Bo Lin