Xujie Si

Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

Ziyan Luo

Tianwei Ni

Pierre-Luc Bacon

Doina Precup

Xujie Si

2025-05-31

ArXiv (preprint)

arxiv.org

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

Tianwei Ni

Pierre-Luc Bacon

Doina Precup

Xujie Si

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (accepted)

openreview.net

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

Tianwei Ni

Pierre-Luc Bacon

Doina Precup

Xujie Si

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (published)

openreview.net

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Honghua Dong

Jiacheng Yang

Xun Deng

Yuhe Jiang

Gennady Pekhimenko

Fan Long

Xujie Si

2025-05-01

ICML.cc/2025/Conference (poster)

openreview.net

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Yuhe Jiang

Xun Deng

Jiacheng Yang

Honghua Dong

Gennady Pekhimenko

Fan Long

Xujie Si

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have … (see more)shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.

2025-03-05

ICLR.cc/2025/Workshop/DL4C (published)

openreview.net

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Zenan Li

Zhaoyu Li

Wen Tang

Xian Zhang

Yuan Yao

Xujie Si

Fan Yang

Kaiyu Yang

Xiaoxing Ma

Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof sys… (see more)tem. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Library Learning Doesn’t: The Curious Case of the Single-Use “Library”

Ian Berlot-Attwell

Frank Rudzicz

Xujie Si

Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim … (see more)to learn a reusable library of *tools*, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

openreview.net

LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

Bowen Li

Zhaoyu Li

Qiwei Du

Jinqi Luo

Wenshan Wang

Yaqi Xie

Simon Stepputtis

Chen Wang

Katia P. Sycara

Pradeep Kumar Ravikumar

Alexander G. Gray

Xujie Si

Sebastian Scherer

Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural n… (see more)etworks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

openreview.net

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Hao Tang

Keya Hu

Jin Peng Zhou

Si Cheng Zhong

Wei-Long Zheng

Xujie Si

Kevin Ellis

2024-09-25

NeurIPS.cc/2024/Conference (poster)

openreview.net

Towards Robust Saliency Maps

Nham Le

Arie Gurfinkel

Xujie Si

Chuqin Geng

Saliency maps are one of the most popular tools to interpret the operation of a neural network: they compute input features deemed relevant … (see more)to the final prediction, which are often subsets of pixels that are easily understandable by a human being. However, it is known that relying solely on human assessment to judge a saliency map method can be misleading. In this work, we propose a new neural network verification specification called saliency-robustness, which aims to use formal methods to prove a relationship between Vanilla Gradient (VG) -- a simple yet surprisingly effective saliency map method -- and the network's prediction: given a network, if an input

2024-09-05

ACML.org/2024/Conference (published)

proceedings.mlr.press

openreview.net

Chronosymbolic Learning: Efficient CHC Solving with Symbolic Reasoning and Inductive Learning

Ziyan Luo

Xujie Si

Solving Constrained Horn Clauses (CHCs) is a fundamental challenge behind a wide range of verification and analysis tasks. Data-driven appro… (see more)aches show great promise in improving CHC solving without the painstaking manual effort of creating and tuning various heuristics. However, a large performance gap exists between data-driven CHC solvers and symbolic reasoning-based solvers. In this work, we develop a simple but effective framework,"Chronosymbolic Learning", which unifies symbolic information and numerical data points to solve a CHC system efficiently. We also present a simple instance of Chronosymbolic Learning with a data-driven learner and a BMC-styled reasoner. Despite its great simplicity, experimental results show the efficacy and robustness of our tool. It outperforms state-of-the-art CHC solvers on a dataset consisting of 288 benchmarks, including many instances with non-linear integer arithmetics.

2024-07-17

Lecture Notes in Computer Science (published)

doi.org

arxiv.org

A Survey on Deep Learning for Theorem Proving

Zhaoyu Li

Jialiang Sun

Logan Murphy

Qidong Su

Zenan Li

Xian Zhang

Kaiyu Yang

Xujie Si

2024-07-10

colmweb.org/COLM/2024/Conference (accepted)

doi.org

openreview.net

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Biography

Current Students

Publications

AI Advantage

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

AI Advantage

Leveraging AI for a Sustainable Future

Popular keywords:

Xujie Si

Biography

Current Students

Publications