Portrait de Smita Krishnaswamy

Smita Krishnaswamy

Membre affilié
Professeure associée, Yale University
Université de Montréal
Yale
Sujets de recherche
Apprentissage de représentations
Apprentissage profond
Apprentissage profond géométrique
Apprentissage spectral
Apprentissage sur variétés
Biologie computationnelle
Géométrie des données
IA en santé
Interfaces cerveau-ordinateur
Modèles génératifs
Modélisation moléculaire
Neurosciences computationnelles
Parcimonie des données
Réseaux de neurones en graphes
Science cognitive
Science des données
Systèmes dynamiques
Théorie de l'information

Biographie

Notre laboratoire travaille sur le développement de méthodes mathématiques fondamentales d'apprentissage automatique et d'apprentissage profond qui intègrent l'apprentissage basé sur les graphes, le traitement du signal, la théorie de l'information, la géométrie et la topologie des données, le transport optimal et la modélisation dynamique qui sont capables d'effectuer une analyse exploratoire, une inférence scientifique, une interprétation et une génération d'hypothèses de grands ensembles de données biomédicales allant des données de cellules uniques, à l'imagerie cérébrale, aux ensembles de données structurelles moléculaires provenant des neurosciences, de la psychologie, de la biologie des cellules souches, de la biologie du cancer, des soins de santé, et de la biochimie. Nos travaux ont été déterminants pour l'apprentissage de trajectoires dynamiques à partir de données instantanées statiques, le débruitage des données, la visualisation, l'inférence de réseaux, la modélisation de structures moléculaires et bien d'autres choses encore.

Étudiants actuels

Collaborateur·rice de recherche - Yale University
Superviseur⋅e principal⋅e :

Publications

Human learning of noninvasive brain–computer interfaces via manifold geometry
Erica L. Busch
E. Chandra Fincke
Nicholas B. Turk‐Browne
ImmunoFoundation: A Multimodal Foundation Model for Immunogenicity Prediction and Peptide Optimization
João Felipe Rocha
Hiren Madhu
Jenny Yongjia Liu
Apurva Mishra
Chen Liu
Rishabh Anand
Rex Ying
Peptide immunogenicity, whether a peptide presented by an MHC molecule elicits a T-cell response, is central to designing vaccines, cancer i… (voir plus)mmunotherapy, and therapeutic proteins. Existing tools rely on a single modality, such as peptide sequences or peptide-MHC interactions, and often ignore the T-cell response that depends on the TCR-peptide-MHC complex (TCR-pMHC) and its three-dimensional structure. The scarcity of labeled TCR-pMHC data with known structures makes it difficult to build a model that captures how all components of the TCR-pMHC contribute to immunogenicity. However, a foundation model of TCR-pMHCs can learn transferable representations across components, which can be adapted to immunogenicity, binding, and TCR specificity tasks, even with limited labeled data. We introduce **ImmunoFoundation**, a self-supervised multimodal backbone for protein-complex representation, fine-tuned for peptide--MHC immunogenicity. The model couples an ESM-2 sequence encoder with a graph transformer over structure, fused via cross-modal attention. Pretraining follows a curriculum that progressively introduces structural inductive bias. **ImmunoFoundation** ourperforms prior multimodal class-I predictors on cancer neoepitope and infectious-disease tasks.
scShapeBench: Discovering geometry from high dimensional scRNAseq data
Andrew J. Steindl
João Felipe Rocha
Brian Tshilengi Di Bassinga
Zachary Warren
Shabarni Gupta
Leire Torices
Daniel Neumann
Timothy J. Mann
Ihuan Gunawan
Dhananjay Bhaskar
John G. Lock
Christine L. Chaffer
High-dimensional point cloud data arise across many scientific domains, especially single-cell biology. The shapes or topologies of these da… (voir plus)tasets determine the types of information that can be extracted. For example, clustered data supports cell-type identification, trajectory structures support transition analysis, and archetypal structures capture continua of cellular behaviors. Existing analysis pipelines often assume a specific shape. The standard Seurat pipeline combines UMAP visualization with Louvain clustering and therefore assumes clustered data, while tools such as Monocle and SPADE assume tree-like structures, and flow-based models such as MIOFlow and Conditional Flow Matching target trajectories. Choosing which pipeline to apply is therefore often left to bioinformaticians who visually inspect datasets before selecting an analysis strategy. With the rise of agentic AI scientists, automating shape detection is increasingly important for selecting downstream analysis pipelines. To address this problem, we introduce scShapeBench, a benchmark dataset for shape detection containing both synthetic and expert-annotated single-cell datasets. Synthetic datasets are sampled from ground-truth skeleton graphs with controlled variance. Real single-cell datasets are curated from diverse sources and annotated by experts into four categories: clusters, single trajectory, multi-branching, and archetypal. We additionally introduce scReebTower, a baseline method that uses diffusion geometry to extract Reeb graphs and connect visualization with pipeline selection. We provide topology-aware evaluation metrics and compare scReebTower against PAGA and Mapper on synthetic and real data. Our results indicate that scReebTower outperforms existing baselines. Overall, our contributions span benchmarks, evaluation metrics, and a baseline for automated shape detection in single-cell data.
RNAGenScape: Property-Guided, Optimized Generation of mRNA Sequences with Manifold Langevin Dynamics.
Danqi Liao
Chen Liu
Xingzhi Sun
Dié Tang
Haochen Wang
Scott Youlten
Srikar Krishna Gopinath
Haejeong Lee
Ethan C. Strayer
Antonio J. Giraldez
Generating property-optimized mRNA sequences is central to applications such as vaccine design and protein replacement therapy, but remains … (voir plus)challenging due to limited data, complex sequence-function relationships, and the narrow space of biologically viable sequences. Generative methods that drift away from the data manifold can yield sequences that fail to fold, translate poorly, or are otherwise nonfunctional. We present RNAGenScape, a property-guided manifold Langevin dynamics framework for mRNA sequence generation that operates directly on a learned manifold of real data. By performing iterative local optimization constrained to this manifold, RNAGenScape preserves biological viability, accesses reliable guidance, and avoids excursions into nonfunctional regions of the ambient sequence space. The framework integrates three components: (1) an autoencoder jointly trained with a property predictor to learn a property-organized latent manifold, (2) a denoising autoencoder that projects updates back onto the manifold, and (3) a property-guided Langevin dynamics procedure that performs optimization along the manifold. Across three real-world mRNA datasets spanning two orders of magnitude in size, RNAGenScape increases median property gain by up to 148% and success rate by up to 30% while ensuring biological viability of generated sequences, and achieves competitive inference efficiency relative to existing generative approaches.
MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data
Xingzhi Sun
João Felipe Rocha
Brett Phelan
Dhananjay Bhaskar
Yanlei Zhang
D. S. Magruder
Ke Xu
Oluwadamilola Fasina
Mark Gerstein
Natalia Ivanova
Christine L. Chaffer
Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disea… (voir plus)se. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data's intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.
DYMAG: Rethinking Message Passing Using Dynamical-systems-based Waveforms
Dhananjay Bhaskar
Xingzhi Sun
Yanlei Zhang
Charles Xu
Oluwadamilola Fasina
Michael Perlmutter
We present DYMAG, a graph neural network based on a novel form of message aggregation. Standard message-passing neural networks, which often… (voir plus) aggregate local neighbors via mean-aggregation, can be regarded as convolving with a simple rectangular waveform which is non-zero only on 1-hop neighbors of every vertex. Here, we go beyond such local averaging. We will convolve the node features with more sophisticated waveforms generated using dynamics such as the heat equation, wave equation, and the Sprott model (an example of chaotic dynamics). Furthermore, we use snapshots of these dynamics at different time points to create waveforms at many effective scales. Theoretically, we show that these dynamic waveforms can capture salient information about the graph, including connected components, connectivity, and cycle structures. Empirically, we test DYMAG on both real and synthetic benchmarks to establish that DYMAG outperforms baseline models on recovery of graph persistence, generating parameters of random graphs, as well as property prediction for proteins, molecules and materials. Our code is available at https://github.com/KrishnaswamyLab/DYMAG.
HypRAG: Hyperbolic Dense Retrieval for Retrieval Augmented Generation
Hiren Madhu
Ngoc Bui
Ali Maatouk
Leandros Tassiulas
Menglin Yang 0001
Sukanta Ganguly
Kiran Srinivasan
Rex Ying
Embedding geometry plays a fundamental role in retrieval quality, yet dense retrievers for retrieval-augmented generation (RAG) remain large… (voir plus)ly confined to Euclidean space. However, natural language exhibits hierarchical structure from broad topics to specific entities that Euclidean embeddings fail to preserve, causing semantically distant documents to appear spuriously similar and increasing hallucination risk. To address these limitations, we introduce hyperbolic dense retrieval, developing two model variants in the Lorentz model of hyperbolic space: HyTE-FH, a fully hyperbolic transformer, and HyTE-H, a hybrid architecture projecting pre-trained Euclidean embeddings into hyperbolic space. To prevent representational collapse during sequence aggregation, we introduce the Outward Einstein Midpoint, a geometry-aware pooling operator that provably preserves hierarchical structure. On MTEB, HyTE-FH outperforms equivalent Euclidean baselines, while on RAGBench, HyTE-H achieves up to 29% gains over Euclidean baselines in context relevance and answer relevance using substantially smaller models than current state-of-the-art retrievers. Our analysis also reveals that hyperbolic representations encode document specificity through norm-based separation, with over 20% radial increase from general to specific concepts, a property absent in Euclidean embeddings, underscoring the critical role of geometric inductive bias in faithful RAG systems.
Dispersion Loss Counteracts Embedding Condensation and Improves Generalization in Small Language Models
Chen Liu
Xingzhi Sun
Xi Xiao
Alexandre Van Tassel
Ke Xu
Kristof Reimann
Danqi Liao
Mark B. Gerstein
Tianyang Wang
Xiao Wang
Large language models (LLMs) achieve remarkable performance through ever-increasing parameter counts, but scaling incurs steep computational… (voir plus) costs. To better understand LLM scaling, we study representational differences between LLMs and their smaller counterparts, with the goal of replicating the representational qualities of larger models in the smaller models. We observe a geometric phenomenon which we term
Self-Supervised Visual Prompting for Cross-Domain Road Damage Detection
Xi Xiao
Zhuxuanzi Wang
Mingqiao Mo
Chen Liu
Chenrui Ma
Yanshu Li
Xiao Wang
Tianyang Wang
The deployment of automated pavement defect detection is often hindered by poor cross-domain generalization. Supervised detectors achieve st… (voir plus)rong in-domain accuracy but require costly re-annotation for new environments, while standard self-supervised methods capture generic features and remain vulnerable to domain shift. We propose \ours, a self-supervised framework that \emph{visually probes} target domains without labels. \ours introduces a Self-supervised Prompt Enhancement Module (SPEM), which derives defect-aware prompts from unlabeled target data to guide a frozen ViT backbone, and a Domain-Aware Prompt Alignment (DAPA) objective, which aligns prompt-conditioned source and target representations. Experiments on four challenging benchmarks show that \ours consistently outperforms strong supervised, self-supervised, and adaptation baselines, achieving robust zero-shot transfer, improved resilience to domain variations, and high data efficiency in few-shot adaptation. These results highlight self-supervised prompting as a practical direction for building scalable and adaptive visual inspection systems. Source code is publicly available: https://github.com/xixiaouab/PROBE/tree/main
Graph topological property recovery with heat and wave dynamics-based features on graphs
Dhananjay Bhaskar
Yanlei Zhang
Charles Xu
Xingzhi Sun
Oluwadamilola Fasina
Maximilian Nickel
Michael Perlmutter
Neural FIM: Bridging Statistical Manifolds and Generative Modeling through Fisher Geometry
Yanlei Zhang
Edward De Brouwer
Danqi Liao
Oluwadamilola Fasina
Ricky T. Q. Chen
Maximilian Nickel
Ian Adelstein
While data diffusion-based embeddings are widely used in unsupervised learning to reveal the intrinsic geometry of data, they are fundamenta… (voir plus)lly constrained by their discrete nature and inability to generalize beyond training points. This limitation ob
RNAGenScape: Property-guided Optimization and Interpolation of mRNA Sequences with Manifold Langevin Dynamics
Danqi Liao
Chen Liu
Xingzhi Sun
Di'e Tang
Haochen Wang
Scott E. Youlten
Srikar Krishna Gopinath
Haejeong Lee
Ethan C. Strayer
Antonio J. Giraldez