Smita Krishnaswamy

Tianyang Wang

Parameter-efficient fine-tuning (PEFT) has become the standard approach for adapting large language models under limited compute and memory … (see more)budgets. Although previous methods improve efficiency through low-rank updates, quantization, or heuristic budget reallocation, they often decouple the allocation of capacity from the way updates evolve during training. In this work, we introduce CTR-LoRA, a framework guided by curvature trust region that integrates rank scheduling with stability-aware optimization. CTR-LoRA allocates parameters based on marginal utility derived from lightweight second-order proxies and constrains updates using a Fisher/Hessian-metric trust region. Experiments on multiple open-source backbones (7B-13B), evaluated on both in-distribution and out-of-distribution benchmarks, show consistent improvements over strong PEFT baselines. In addition to increased accuracy, CTR-LoRA enhances training stability, reduces memory requirements, and achieves higher throughput, positioning it on the Pareto frontier of performance and efficiency. These results highlight a principled path toward more robust and deployable PEFT.

2025-10-11

ArXiv (preprint)

Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets

David R. Johnson

Rishabh Anand

Michael Perlmutter

2025-10-01

ArXiv (preprint)

Equivariant Geometric Scattering Networks via Vector Diffusion Wavelets

David R. Johnson

Rishabh Anand

Michael Perlmutter

We introduce a novel version of the geometric scattering transform for geometric graphs containing scalar and vector node features. This new… (see more) scattering transform has desirable symmetries with respect to rigid-body roto-translations (i.e.,

2025-10-01

ArXiv (preprint)

HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data

Hiren Madhu

João Felipe Rocha

Tinglin Huang

Rex Ying

2025-09-25

ArXiv (preprint)

HEIST: A Graph Foundation Model for Spatial Transcriptomics and
Proteomics Data

Hiren Madhu

João Felipe Rocha

Tinglin Huang

Rex Ying

2025-09-25

ArXiv (preprint)

www.ncbi.nlm.nih.gov

A Graph Laplacian Eigenvector-based Pre-training Method for Graph Neural Networks

Howard Dai

Nyambura Njenga

Hiren Madhu

Ryan Pellico

Ian Adelstein

The development of self-supervised graph pre-training methods is a crucial ingredient in recent efforts to design robust graph foundation mo… (see more)dels (GFMs). Structure-based pre-training methods are under-explored yet crucial for downstream applications which rely on underlying graph structure. In addition, pre-training traditional message passing GNNs to capture global and regional structure is often challenging due to the risk of oversmoothing as network depth increases. We address these gaps by proposing the Laplacian Eigenvector Learning Module (LELM), a novel pre-training module for graph neural networks (GNNs) based on predicting the low-frequency eigenvectors of the graph Laplacian. Moreover, LELM introduces a novel architecture that overcomes oversmoothing, allowing the GNN model to learn long-range interdependencies. Empirically, we show that models pre-trained via our framework outperform baseline models on downstream molecular property prediction tasks.

2025-09-02

ArXiv (preprint)

Learning Laplacian Eigenvectors: a Pre-training Method for Graph Neural Networks

Howard Dai

Nyambura Njenga

Benjamin Whitsett

Catherine Ma

Darwin Deng

Sara de 'Angel

Alexandre Van Tassel

Ryan Pellico

Ian Adelstein

2025-09-02

ArXiv (preprint)

Low-dimensional embeddings of high-dimensional data

Cyril de Bodt

Alex Diaz-Papkovich

Michael Bleher

Kerstin Bunte

Corinna Coupette

Sebastian Damrich

Enrique Fita Sanmartin

Fred Hamprecht

EmHoke-'Agnes Horv'at

Dhruv Kohli

John A. Lee 0001

Boudewijn P. F. Lelieveldt

Leland McInnes

Ian T. Nabney

Maximilian Noichl

Pavlin G. Polivcar

Bastian Rieck

Guy Wolf

Gal Mishne … (see 1 more)

Dmitry Kobak

Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from b… (see more)iology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.

2025-08-21

ArXiv (preprint)

Low-dimensional embeddings of high-dimensional data

Cyril de Bodt

Alex Diaz-Papkovich

Michael Bleher

Kerstin Bunte

Corinna Coupette

Sebastian Damrich

Enrique Fita Sanmartin

Fred A. Hamprecht

EmHoke-'Agnes Horv'at

Dhruv Kohli

John A. Lee 0001

Boudewijn P. F. Lelieveldt

Leland McInnes

Ian T. Nabney

Maximilian Noichl

Pavlin G. Polivcar

Bastian Rieck

Guy Wolf

Gal Mishne … (see 1 more)

Dmitry Kobak

Large collections of high-dimensional data have become nearly ubiquitous across many academic fields and application domains, ranging from b… (see more)iology to the humanities. Since working directly with high-dimensional data poses challenges, the demand for algorithms that create low-dimensional representations, or embeddings, for data visualization, exploration, and analysis is now greater than ever. In recent years, numerous embedding algorithms have been developed, and their usage has become widespread in research and industry. This surge of interest has resulted in a large and fragmented research field that faces technical challenges alongside fundamental debates, and it has left practitioners without clear guidance on how to effectively employ existing methods. Aiming to increase coherence and facilitate future work, in this review we provide a detailed and critical overview of recent developments, derive a list of best practices for creating and using low-dimensional embeddings, evaluate popular approaches on a variety of datasets, and discuss the remaining challenges and open problems in the field.

2025-08-21

ArXiv (preprint)

Revealing dynamic temporal trajectories and underlying regulatory networks with Cflows

Alexander Tong

Manik Kuchroo

Shabarni Gupta

Aarthi Venkat

Beatriz P. San Juan

Laura Rangel

Brandon Zhu

John G. Lock

Christine L. Chaffer

While single-cell technologies provide snapshots of tumor states, building continuous trajectories and uncovering causative gene regulatory … (see more)networks remains a significant challenge. We present Cflows, an AI framework that combines neural ODE networks with Granger causality to infer continuous cell state transitions and gene regulatory interactions from static scRNA-seq data. In a new 5-time point dataset capturing tumorsphere development over 30 days, Cflows reconstructs two types of trajectories leading to tumorsphere formation or apoptosis. Trajectory-based cell-of-origin analysis delineated a novel cancer stem cell profile characterized by CD44hiEPCAM+CAV1+, and uncovered a cell cycle–dependent enrichment of tumorsphere-initiating potential in G2/M or S-phase cells. Cflows uncovers ESRRA as a crucial causal driver of the tumor-forming gene regulatory network. Indeed, ESRRA inhibition significantly reduces tumor growth and metastasis in vivo. Cflows offers a powerful framework for uncovering cellular transitions and dynamic regulatory networks from static single-cell data.

2025-08-07

bioRxiv (preprint)

CellForge: Agentic Design of Virtual Cell Models

Xiangru Tang

Zhuoyun Yu

Jiapeng Chen

Yan Cui

Yanjun Shao

Weixu Wang

Fang Wu

Yuchen Zhuang

Wenqi Shi

Zhi Huang

Arman Cohan

Xihong Lin

Fabian Theis

Mark B. Gerstein

Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantiti… (see more)es such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

2025-08-04

ArXiv (preprint)

CellForge: Agentic Design of Virtual Cell Models

Xiangru Tang

Zhuoyun Yu

Jiapeng Chen

Yan Cui

Yanjun Shao

Weixu Wang

Fang Wu

Yuchen Zhuang

Wenqi Shi

Zhi Huang

Arman Cohan

Xihong Lin

Fabian Theis

Mark B. Gerstein

Virtual cell modeling represents an emerging frontier at the intersection of artificial intelligence and biology, aiming to predict quantiti… (see more)es such as responses to diverse perturbations quantitatively. However, autonomously building computational models for virtual cells is challenging due to the complexity of biological systems, the heterogeneity of data modalities, and the need for domain-specific expertise across multiple disciplines. Here, we introduce CellForge, an agentic system that leverages a multi-agent framework that transforms presented biological datasets and research objectives directly into optimized computational models for virtual cells. More specifically, given only raw single-cell multi-omics data and task descriptions as input, CellForge outputs both an optimized model architecture and executable code for training virtual cell models and inference. The framework integrates three core modules: Task Analysis for presented dataset characterization and relevant literature retrieval, Method Design, where specialized agents collaboratively develop optimized modeling strategies, and Experiment Execution for automated generation of code. The agents in the Design module are separated into experts with differing perspectives and a central moderator, and have to collaboratively exchange solutions until they achieve a reasonable consensus. We demonstrate CellForge's capabilities in single-cell perturbation prediction, using six diverse datasets that encompass gene knockouts, drug treatments, and cytokine stimulations across multiple modalities. CellForge consistently outperforms task-specific state-of-the-art methods. Overall, CellForge demonstrates how iterative interaction between LLM agents with differing perspectives provides better solutions than directly addressing a modeling challenge. Our code is publicly available at https://github.com/gersteinlab/CellForge.

2025-08-04

ArXiv (preprint)