Jianan Zhao

arxiv.org

GeneZip: Region-Aware Compression for Long Context DNA Modeling

Hongyu Guo

Genomic sequences span billions of base pairs (bp), posing a fundamental challenge for genome-scale foundation models. Existing approaches l… (voir plus)argely sidestep this barrier by either scaling relatively small models to long contexts or relying on heavy multi-GPU parallelism. Here we introduce GeneZip, a DNA compression model that leverages a key biological prior: genomic information is highly imbalanced. Coding regions comprise only a small fraction (about 2 percent) yet are information-dense, whereas most non-coding sequence is comparatively information-sparse. GeneZip couples HNet-style dynamic routing with a region-aware compression-ratio objective, enabling adaptive allocation of representation budget across genomic regions. As a result, GeneZip learns region-aware compression and achieves 137.6x compression with only 0.31 perplexity increase. On downstream long-context benchmarks, GeneZip achieves comparable or better performance on contact map prediction, expression quantitative trait loci prediction, and enhancer-target gene prediction. By reducing effective sequence length, GeneZip unlocks simultaneous scaling of context and capacity: compared to the prior state-of-the-art model JanusDNA, it enables training models 82.6x larger at 1M-bp context, supporting a 636M-parameter GeneZip model at 1M-bp context. All experiments in this paper can be trained on a single A100 80GB GPU.

2026-02-18

arXiv (prépublication)

Fast Proteome-Scale Protein Interaction Retrieval via Residue-Level Factorization

Narendra Chaudhary

Qian Cong

Jian Zhou

Sanchit Misra

Protein-protein interactions (PPIs) are mediated at the residue level. Most sequence-based PPI models consider residue-residue interactions … (voir plus)across two proteins, which can yield accurate interaction scores but are too slow to scale. At proteome scale, identifying candidate PPIs requires evaluating nearly *all possible protein pairs*. For

2025-12-31

International Conference on Learning Representations (Accept (Poster))

Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention

Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominan… (voir plus)t Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).

2025-09-17

NeurIPS.cc/2025/Conference (poster)

Fully-Inductive Node Classification on Arbitrary Graphs

Hesham Mostafa

Michael Bronstein

One fundamental challenge in graph machine learning is generalizing to new graphs. Many existing methods following the inductive setup can g… (voir plus)eneralize to test graphs with new structures, but assuming the feature and label spaces remain the same as the training ones. This paper introduces a fully-inductive setup, where models should perform inference on arbitrary test graphs with new structures, feature and label spaces. We propose GraphAny as the first attempt at this challenging setup. GraphAny models inference on a new graph as an analytical solution to a LinearGNN, which can be naturally applied to graphs with any feature and label spaces. To further build a stronger model with learning capacity, we fuse multiple LinearGNN predictions with learned inductive attention scores. Specifically, the attention module is carefully parameterized as a function of the entropy-normalized distance features between pairs of LinearGNN predictions to ensure generalization to new graphs. Empirically, GraphAny trained on a single Wisconsin dataset with only 120 labeled nodes can generalize to 30 new graphs with an average accuracy of 67.26%, surpassing not only all inductive baselines, but also strong transductive methods trained separately on each of the 30 test graphs.

2025-01-21

ICLR.cc/2025/Conference (poster)

Cell ontology guided transcriptome foundation model

Manqi Zhou

Boyu Han

Transcriptome foundation models (TFMs) hold great promises of deciphering the transcriptomic language that dictate diverse cell functions by… (voir plus) self-supervised learning on large-scale single-cell gene expression data, and ultimately unraveling the complex mechanisms of human diseases. However, current TFMs treat cells as independent samples and ignore the taxonomic relationships between cell types, which are available in cell ontology graphs. We argue that effectively leveraging this ontology information during the TFM pre-training can improve learning biologically meaningful gene co-expression patterns while preserving TFM as a general purpose foundation model for downstream zero-shot and fine-tuning tasks. To this end, we present **s**ingle **c**ell, **Cell**-**o**ntology guided TFM (scCello). We introduce cell-type coherence loss and ontology alignment loss, which are minimized along with the masked gene expression prediction loss during the pre-training. The novel loss component guide scCello to learn the cell-type-specific representation and the structural relation between cell types from the cell ontology graph, respectively. We pre-trained scCello on 22 million cells from CellxGene database leveraging their cell-type labels mapped to the cell ontology graph from Open Biological and Biomedical Ontology Foundry. Our TFM demonstrates competitive generalization and transferability performance over the existing TFMs on biologically important tasks including identifying novel cell types of unseen cells, prediction of cell-type-specific marker genes, and cancer drug responses. Source code and model weights are available at https://github.com/DeepGraphLearning/scCello.

2024-12-10

Neural Information Processing Systems (Accept (spotlight))

GraphText: Graph Reasoning in Text Space

Jianan Zhao

Le Zhuo

Yikang Shen

Meng Qu

Kai Liu

Michael M. Bronstein

Zhaocheng Zhu

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

The 1st International Workshop on Graph Foundation Models (GFM).

Haitao Mao

Jianan Zhao

Xiaoxin He

Zhikai Chen

Qian Huang

Zhaocheng Zhu

Micheal Bronstein

Xavier Bresson

Bryan Hooi

Haiyang Zhang

Xianfeng Tang

Luo Chen

Jiliang Tang

Foundation models such as GPT-4 for natural language processing (NLP), Flamingo for computer vision (CV), have set new benchmarks in AI by d… (voir plus)elivering state-of-the-art results across various tasks with minimal task-specific data. Despite their success, the application of these models to the graph domain is challenging due to the relational nature of graph-structured data. To address this gap, we propose the Graph Foundation Model (GFM) Workshop, the first workshop for GFMs, dedicated to exploring the adaptation and development of foundation models specifically designed for graph data. The GFM workshop focuses on two critical questions: (1) How can the underlying capabilities of existing foundation models be effectively applied to graph data? (2) What foundational principles should guide the creation of models tailored to the graph domain? Through a curated set of panel sections, keynote talks, and paper presentations, our workshop intends to catalyze innovative approaches and theoretical frameworks for Graph Foundation Models (GFMs). We target a broad audience, encompassing researchers, practitioners, and students, and aim to lay the groundwork for the next wave of breakthroughs in integrating graph data with foundation models.

2024-05-12

Companion Proceedings of the ACM on Web Conference 2024 (publié)

Learning on Large-Scale Text-Attributed Graphs via Variational Inference

Jianan Zhao

Meng Qu

Chaozhuo Li

Hao Yan

Qian Liu

Rui Li

Xing Xie

This paper studies learning on text-attributed graphs (TAGs), where each node is associated with a text description. An ideal solution for s… (voir plus)uch a problem would be integrating both the text and graph structure information with large language models and graph neural networks (GNNs). However, the problem becomes very challenging when graphs are large due to the high computational complexity brought by training large language models and GNNs together. In this paper, we propose an efficient and effective solution to learning on large text-attributed graphs by fusing graph structure and language learning with a variational Expectation-Maximization (EM) framework, called GLEM. Instead of simultaneously training large language models and GNNs on big graphs, GLEM proposes to alternatively update the two modules in the E-step and M-step. Such a procedure allows training the two modules separately while simultaneously allowing the two modules to interact and mutually enhance each other. Extensive experiments on multiple data sets demonstrate the efficiency and effectiveness of the proposed approach.

2023-01-31

ICLR.cc/2023/Conference (notable)