Xujie Si

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Honghua Dong

Jiacheng Yang

Xun Deng

Yuhe Jiang

Gennady Pekhimenko

Fan Long

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have … (see more)shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

Learning Minimal Neural Specifications

Chuqin Geng

Zhaoyue Wang

Haolin Ye

Xujie Si

2025-07-08

Proceedings of the International Conference on Neuro-symbolic Systems (published)

proceedings.mlr.press

Understanding Behavioral Metric Learning: A Large-Scale Study on Distracting Reinforcement Learning Environments

Ziyan Luo

2025-05-09

rl-conference.cc/RLC/2025/Conference (accepted)

doi.org

openreview.net

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (published)

openreview.net

Understanding the Effectiveness of Learning Behavioral Metrics in Deep Reinforcement Learning

Ziyan Luo

A key approach to state abstraction is approximating behavioral metrics (notably, bisimulation metrics) in the observation space, and embed … (see more)these learned distances in the representation space. While promising for robustness to task-irrelevant noise shown in prior work, accurately estimating these metrics remains challenging, requiring various design choices that create gaps between theory and practice. Prior evaluations focus mainly on final returns, leaving the quality of learned metrics and the source of performance gains unclear. To systematically assess how metric learning works in deep RL, we evaluate five recent approaches. We unify them under isometric embedding, identify key design choices, and benchmark them with baselines across 20 state-based and 14 pixel-based tasks, spanning 250+ configurations with diverse noise settings. Beyond final returns, we introduce the denoising factor to quantify the encoder’s ability to filter distractions. To further isolate the effect of metric learning, we propose an isolated metric estimation setting, where the encoder is influenced solely by the metric loss. Our results show that metric learning improves return and denoising only marginally, as its benefits fade when key design choices, such as layer normalization and self-prediction loss, are incorporated into the baseline. We also find that commonly used benchmarks (e.g., grayscale videos, varying state-based Gaussian noise dimensions) add little difficulty, while Gaussian noise with random projection and pixel-based Gaussian noise remain challenging even for the best methods. Finally, we release an open-source, modular codebase to improve reproducibility and support future research on metric learning in deep RL.

2025-05-09

rl-conference.cc/RLC/2025/Conference (accepted)

openreview.net

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Honghua Dong

Jiacheng Yang

Xun Deng

Yuhe Jiang

Gennady Pekhimenko

Fan Long

Xujie Si

2025-05-01

ICML.cc/2025/Conference (poster)

proceedings.mlr.press

openreview.net

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Yuhe Jiang

Xun Deng

Jiacheng Yang

Honghua Dong

Gennady Pekhimenko

Fan Long

Xujie Si

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have … (see more)shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.

2025-03-05

ICLR.cc/2025/Workshop/DL4C (published)

openreview.net

Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning

Zenan Li

Zhaoyu Li

Wen Tang

Xian Zhang

Yuan Yao

Xujie Si

Fan Yang

Kaiyu Yang

Xiaoxing Ma

Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof sys… (see more)tem. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

Library Learning Doesn’t: The Curious Case of the Single-Use “Library”

Ian Berlot-Attwell

Frank Rudzicz

Xujie Si

Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim … (see more)to learn a reusable library of *tools*, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.

2024-10-09

NeurIPS.cc/2024/Workshop/MATH-AI (accepted)

openreview.net

LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

Bowen Li

Zhaoyu Li

Qiwei Du

Jinqi Luo

Wenshan Wang

Yaqi Xie

Simon Stepputtis

Chen Wang

Katia P. Sycara

Pradeep Kumar Ravikumar

Alexander G. Gray

Xujie Si

Sebastian Scherer

Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural n… (see more)etworks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as

2024-09-26

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

openreview.net

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

Hao Tang

Keya Hu

Jin Peng Zhou

Si Cheng Zhong

Wei-Long Zheng

Xujie Si

Kevin Ellis

2024-09-25

NeurIPS.cc/2024/Conference (poster)

openreview.net

Towards Robust Saliency Maps

Nham Le

Arie Gurfinkel

Xujie Si

Chuqin Geng

Saliency maps are one of the most popular tools to interpret the operation of a neural network: they compute input features deemed relevant … (see more)to the final prediction, which are often subsets of pixels that are easily understandable by a human being. However, it is known that relying solely on human assessment to judge a saliency map method can be misleading. In this work, we propose a new neural network verification specification called saliency-robustness, which aims to use formal methods to prove a relationship between Vanilla Gradient (VG) -- a simple yet surprisingly effective saliency map method -- and the network's prediction: given a network, if an input

2024-09-05

ACML.org/2024/Conference (published)

proceedings.mlr.press

openreview.net

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Biography

Current Students

Publications

Hackathon | Building safer AI for youth mental health

Indigenous Pathfinders in AI

AI Advantage

Popular keywords:

Xujie Si

Biography

Current Students

Publications