Portrait of Xujie Si

Xujie Si

Affiliate Member
Canada CIFAR AI Chair
Assistant Professor, University of Toronto, Department of Computer Science
Research Topics
Learning to Program
Reasoning
Representation Learning

Biography

Xujie Si is an assistant professor in the Department of Computer Science, University of Toronto. He is also an affiliate member at Vector Institute and an affiliate member at Mila – Quebec Artificial Intelligence Institute, where he holds a Canada CIFAR AI Chair.

Si received his PhD from the University of Pennsylvania in 2020, his master's degree from Vanderbilt University, and his bachelor's degree (with Honors) from Nankai University.

Si’s research lies at the intersection of programming languages and AI. He is generally interested in developing learning-based techniques to help programmers build better software with less effort, integrating logic programming with differentiable learning systems for interpretable and scalable reasoning, and leveraging programming abstractions for reliable and data-efficient learning.

His work has been honoured with an ACM Special Interest Group on Programming Languages (SIGPLAN) Distinguished Paper Award, as well as highlighted at top conferences on programming languages and machine learning.

Current Students

Postdoctorate - McGill University
Principal supervisor :
PhD - McGill University
PhD - McGill University
Co-supervisor :
Master's Research - McGill University
Master's Research - McGill University

Publications

TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories
Yuhe Jiang
Xun Deng
Jiacheng Yang
Honghua Dong
Gennady Pekhimenko
Fan Long
Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have … (see more)shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.
Proving Olympiad Inequalities by Synergizing LLMs and Symbolic Reasoning
Zenan Li
Zhaoyu Li
Wen Tang
Xian Zhang
Yuan Yao
Fan Yang
Kaiyu Yang
Xiaoxing Ma
Large language models (LLMs) can prove mathematical theorems formally by generating proof steps (\textit{a.k.a.} tactics) within a proof sys… (see more)tem. However, the space of possible tactics is vast and complex, while the available training data for formal proofs is limited, posing a significant challenge to LLM-based tactic generation. To address this, we introduce a neuro-symbolic tactic generator that synergizes the mathematical intuition learned by LLMs with domain-specific insights encoded by symbolic methods. The key aspect of this integration is identifying which parts of mathematical reasoning are best suited to LLMs and which to symbolic methods. While the high-level idea of neuro-symbolic integration is broadly applicable to various mathematical problems, in this paper, we focus specifically on Olympiad inequalities (Figure~1). We analyze how humans solve these problems and distill the techniques into two types of tactics: (1) scaling, handled by symbolic methods, and (2) rewriting, handled by LLMs. In addition, we combine symbolic tools with LLMs to prune and rank the proof goals for efficient proof search. We evaluate our framework on 161 challenging inequalities from multiple mathematics competitions, achieving state-of-the-art performance and significantly outperforming existing LLM and symbolic approaches without requiring additional training data.
Library Learning Doesn’t: The Curious Case of the Single-Use “Library”
Ian Berlot-Attwell
Frank Rudzicz
Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim … (see more)to learn a reusable library of *tools*, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools? We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.
LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation
Bowen Li
Zhaoyu Li
Qiwei Du
Jinqi Luo
Wenshan Wang
Yaqi Xie
Simon Stepputtis
Chen Wang
Katia P. Sycara
Pradeep Kumar Ravikumar
Alexander G. Gray
Sebastian Scherer
Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural n… (see more)etworks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as
Code Repair with LLMs gives an Exploration-Exploitation Tradeoff
Hao Tang
Keya Hu
Jin Peng Zhou
Si Cheng Zhong
Wei-Long Zheng
Kevin Ellis
Towards Robust Saliency Maps
Nham Le
Arie Gurfinkel
Chuqin Geng
Saliency maps are one of the most popular tools to interpret the operation of a neural network: they compute input features deemed relevant … (see more)to the final prediction, which are often subsets of pixels that are easily understandable by a human being. However, it is known that relying solely on human assessment to judge a saliency map method can be misleading. In this work, we propose a new neural network verification specification called saliency-robustness, which aims to use formal methods to prove a relationship between Vanilla Gradient (VG) -- a simple yet surprisingly effective saliency map method -- and the network's prediction: given a network, if an input
Chronosymbolic Learning: Efficient CHC Solving with Symbolic Reasoning and Inductive Learning
Ziyan Luo
Solving Constrained Horn Clauses (CHCs) is a fundamental challenge behind a wide range of verification and analysis tasks. Data-driven appro… (see more)aches show great promise in improving CHC solving without the painstaking manual effort of creating and tuning various heuristics. However, a large performance gap exists between data-driven CHC solvers and symbolic reasoning-based solvers. In this work, we develop a simple but effective framework,"Chronosymbolic Learning", which unifies symbolic information and numerical data points to solve a CHC system efficiently. We also present a simple instance of Chronosymbolic Learning with a data-driven learner and a BMC-styled reasoner. Despite its great simplicity, experimental results show the efficacy and robustness of our tool. It outperforms state-of-the-art CHC solvers on a dataset consisting of 288 benchmarks, including many instances with non-linear integer arithmetics.
A Survey on Deep Learning for Theorem Proving
Zhaoyu Li
Jialiang Sun
Logan Murphy
Qidong Su
Zenan Li
Xian Zhang
Kaiyu Yang
Autoformalizing Euclidean Geometry
Logan Murphy
Kaiyu Yang
Jialiang Sun
Zhaoyu Li
Animashree Anandkumar
Autoformalization involves automatically translating informal math into formal theorems and proofs that are machine-verifiable. Euclidean ge… (see more)ometry provides an interesting and controllable domain for studying autoformalization. In this paper, we introduce a neuro-symbolic framework for autoformalizing Euclidean geometry, which combines domain knowledge, SMT solvers, and large language models (LLMs). One challenge in Euclidean geometry is that informal proofs rely on diagrams, leaving gaps in texts that are hard to formalize. To address this issue, we use theorem provers to fill in such diagrammatic information automatically, so that the LLM only needs to autoformalize the explicit textual steps, making it easier for the model. We also provide automatic semantic evaluation for autoformalized theorem statements. We construct LeanEuclid, an autoformalization benchmark consisting of problems from Euclid’s Elements and the UniGeo dataset formalized in the Lean proof assistant. Experiments with GPT-4 and GPT-4V show the capability and limitations of state-of-the-art LLMs on autoformalizing geometry problems. The data and code are available at https://github.com/loganrjmurphy/LeanEuclid.
APPL: A Prompt Programming Language for Harmonious Integration of Programs and Large Language Model Prompts
Honghua Dong
Qidong Su
Yubo Gao
Zhaoyu Li
Yangjun Ruan
Gennady G. Pekhimenko
Chris J. Maddison
Large Language Models (LLMs) have become increasingly capable of handling diverse tasks with the aid of well-crafted prompts and integration… (see more) of external tools, but as task complexity rises, the workflow involving LLMs can be complicated and thus challenging to implement and maintain. To address this challenge, we propose APPL, A Prompt Programming Language that acts as a bridge between computer programs and LLMs, allowing seamless embedding of prompts into Python functions, and vice versa. APPL provides an intuitive and Python-native syntax, an efficient parallelized runtime with asynchronous semantics, and a tracing module supporting effective failure diagnosis and replaying without extra costs. We demonstrate that APPL programs are intuitive, concise, and efficient through three representative scenarios: Chain-of-Thought with self-consistency (CoT-SC), ReAct tool use agent, and multi-agent chat. Experiments on three parallelizable workflows further show that APPL can effectively parallelize independent LLM calls, with a significant speedup ratio that almost matches the estimation.
Learning Minimal NAP Specifications for Neural Network Verification
Chuqin Geng
Zhaoyue Wang
Haolin Ye
Saifei Liao
Specifications play a crucial role in neural network verification. They define the precise input regions we aim to verify, typically represe… (see more)nted as L-infinity norm balls. While recent research suggests using neural activation patterns (NAPs) as specifications for verifying unseen test set data, it focuses on computing the most refined NAPs, often limited to very small regions in the input space. In this paper, we study the following problem: Given a neural network, find a minimal (coarsest) NAP that is sufficient for formal verification of the network's robustness. Finding the minimal NAP specification not only expands verifiable bounds but also provides insights into which neurons contribute to the model's robustness. To address this problem, we propose several exact and approximate approaches. Our exact approaches leverage the verification tool to find minimal NAP specifications in either a deterministic or statistical manner. Whereas the approximate methods efficiently estimate minimal NAPs using adversarial examples and local gradients, without making calls to the verification tool. This allows us to inspect potential causal links between neurons and the robustness of state-of-the-art neural networks, a task for which existing verification frameworks fail to scale. Our experimental results suggest that minimal NAP specifications require much smaller fractions of neurons compared to the most refined NAP specifications, yet they can significantly expand the verifiable boundaries to several orders of magnitude larger.
SAT-DIFF: A Tree Diffing Framework Using SAT Solving
Chuqin Geng
Haolin Ye
Yihan Zhang
Brigitte Pientka
Computing differences between tree-structured data is a critical but challenging problem in software analysis. In this paper, we propose a n… (see more)ovel tree diffing approach called SatDiff, which reformulates the structural diffing problem into a MaxSAT problem. By encoding the necessary transformations from the source tree to the target tree, SatDiff generates correct, minimal, and type safe low-level edit scripts with formal guarantees. We then synthesize concise high-level edit scripts by effectively merging low-level edits in the appropriate topological order. Our empirical results demonstrate that SatDiff outperforms existing heuristic-based approaches by a significant margin in terms of conciseness while maintaining a reasonable runtime.