Publications

Chromatin landscape and enhancer-gene interaction differences between three cardiac cell types
Chukwuemeka George Anene-Nzelu
Yan Zhu
Jean‐Christophe Grenier
Raphaël Poujol
Svenja Koslowski
Olivier Tastet
Chang Jie Mick Lee
Matthew Ackers‐Johnson
Roger Foo
ABSTRACT Genome-wide association studies (GWAS) have identified numerous single nucleotide polymorphisms (SNP) associated with a specific tr… (voir plus)aits and diseases, however, uncovering the true disease-relevant SNPs remains challenging. One limitation for prioritizing true disease-relevant SNPs from GWAS is that most of the identified SNPs are non-coding, making it difficult to unravel their mechanism of action. Nevertheless, mapping non-coding SNPs to enhancers is a validated approach to link SNPs to their target genes through the analysis of enhancer-gene interactions (EGI) and thus provide insight into their mechanism of action. While previous studies linking cardiac disease-relevant SNPs to enhancers and their target genes have focused on the principal cardiac cell type, cardiomyocytes (CMs), the analysis of other non-CM cell types has been largely ignored and has only gained attention recently. We hypothesize that characterizing cell-type-specific enhancer-gene interactions (EGIs) for these non-CMs, namely cardiac fibroblasts (CFs), endothelial cells (ECs), and smooth muscle cells (SMCs), followed by mapping cardiac-disease-associated non-coding SNPs to those enhancers will identify novel disease-relevant genes and provide insights for future mechanistic research. To identify the landscape of cell-type-specific EGIs in these cardiac cells, we have employed the activity-by-Contact (ABC) model. It integrates assay for transposase-accessible chromatin sequencing (ATAC-seq), H3K27ac chromatin immunoprecipitation with sequencing (ChIP-seq), and high-throughput chromosome conformation capture with H3K27ac immunoprecipitation (H3K27ac HiChIP) data to identify EGIs. We have identified the landscape of cell-type-specific EGIs in these cardiac cells. Furthermore, a higher similarity of the chromatin accessibility profile (ATAC-seq) between CF and SMC, compared to CF and EC, and SMC and EC was observed. Finally, overlapping identified EGIs with cardiac-disease-associated non-coding variants has allowed the identification of a QT-interval-associated SNP that is mapped to the enhancer region of an EC-specific EGI.
HEIST: A Graph Foundation Model for Spatial Transcriptomics and Proteomics Data
Hiren Madhu
João Felipe Rocha
Tinglin Huang
Rex Ying
Neither Valid Nor Reliable? Investigating the Use of LLMs as Judges
Mohammed Haddou
Jackie CK Cheung
Rigor in AI: Doing Rigorous AI Work Requires a Broader, Responsible AI-Informed Conception of Rigor
A.R. Olteanu
Agathe Balayn
Angelina Wang
Flavio Calmon
Margaret Mitchell
Michael Ekstrand
Reuben Binns
Solon Barocas
In AI research and practice, rigor remains largely understood in terms of methodological rigor -- such as whether mathematical, statistical,… (voir plus) or computational methods are correctly applied. We argue that this narrow conception of rigor has contributed to the concerns raised by the responsible AI community, including overblown claims about AI capabilities. Our position is that a broader conception of what rigorous AI research and practice should entail is needed. We believe such a conception -- in addition to a more expansive understanding of (1) methodological rigor -- should include aspects related to (2) what background knowledge informs what to work on (epistemic rigor); (3) how disciplinary, community, or personal norms, standards, or beliefs influence the work (normative rigor); (4) how clearly articulated the theoretical constructs under use are (conceptual rigor); (5) what is reported and how (reporting rigor); and (6) how well-supported the inferences from existing evidence are (interpretative rigor). In doing so, we also aim to provide useful language and a framework for much-needed dialogue about the AI community's work by researchers, policymakers, journalists, and other stakeholders.
Benchmarking Machine Learning Potentials for Crystal Structure Relaxation
High-throughput materials discovery workflows require rapid and accurate relaxation of crystal structures to identify thermodynamically stab… (voir plus)le phases among thousands to millions of candidate structures. Yet current machine learning interatomic potential (MLIP) benchmarks focus predominantly on energy prediction rather than structure relaxation, creating a critical evaluation gap for models designed to accelerate optimization. Additionally, these benchmarks are trained on datasets consisting mainly of known stable or near-stable materials, thus failing to capture the challenges of unexplored chemical spaces. We address these limitations by introducing a benchmark that evaluates state-of-the-art MLIPs and a one-shot relaxation model on structure relaxation with crystals generated via a reinforcement learning pipeline. We compare energy lowering and average maximum force computed via DFT, as well as relaxation runtime. We also contrast direct force-prediction strategies against conservative energy-differentiation approaches to determine which paradigm delivers superior relaxation performance. Our results indicate that there is a clear disconnect between MLIP energy prediction and force convergence in relaxation, challenging current benchmarking approaches.
Consistent Synthetic Sequences Unlock Structural Diversity in Fully Atomistic De Novo Protein Design
Danny Reidenbach
Zhonglin Cao
Kieran Didi
Tomas Geffner
Guoqing Zhou
Christian Dallago
Arash Vahdat
Emine Kucukbenli
Karsten Kreis
High-quality training datasets are crucial for the development of effective protein design models, but existing synthetic datasets often inc… (voir plus)lude unfavorable sequence-structure pairs, impairing generative model performance. We leverage ProteinMPNN, whose sequences are experimentally favorable as well as amenable to folding, together with structure prediction models to align high-quality synthetic structures with recoverable synthetic sequences. In that way, we create a new dataset designed specifically for training expressive, fully atomistic protein generators. By retraining La-Proteína, which models discrete residue type and side chain structure in a continuous latent space, on this dataset, we achieve new state-of-the-art results, with improvements of +54% in structural diversity and +27% in co-designability. To validate the broad utility of our approach, we further introduce Proteína-Atomística, a unified flow-based framework that jointly learns the distribution of protein backbone structure, discrete sequences, and atomistic side chains without latent variables. We again find that training on our new sequence-structure data dramatically boosts benchmark performance, improving Proteína-Atomística’s structural diversity by +73% and co-designability by +5%. Our work highlights the critical importance of aligned sequence-structure data for training high-performance de novo protein design models. All data will be publicly released.
Localized-Attention-Guided Concept Erasure for Text-to-Image Diffusion Models
Source-free cross-modality medical image synthesis with diffusion priors
Jia Chen
Kai Yang
Xinrong Hu
Stochastic Chameleons: Irrelevant Context Hallucinations Reveal Class-Based (Mis)Generalization in LLMs
Meng Cao
Marc-Antoine Rondeau
Jackie CK Cheung
The widespread success of LLMs on NLP benchmarks has been accompanied by concerns that LLMs function primarily as stochastic parrots that re… (voir plus)produce texts similar to what they saw during pre-training, often erroneously. But what is the nature of their errors, and do these errors exhibit any regularities? In this work, we examine irrelevant context hallucinations, in which models integrate misleading contextual cues into their predictions. Through behavioral analysis, we show that these errors result from a structured yet flawed mechanism that we term _class-based (mis)generalization_, in which models combine abstract class cues with features extracted from the query or context to derive answers. Furthermore, mechanistic interpretability experiments on Llama-3, Mistral, and Pythia across 39 factual recall relation types reveal that this behavior is reflected in the model's internal computations: (i) abstract class representations are constructed in lower layers before being refined into specific answers in higher layers, (ii) feature selection is governed by two competing circuits --- one prioritizing direct query-based reasoning, the other incorporating contextual cues --- whose relative influences determine the final output. Our findings provide a more nuanced perspective on the stochastic parrot argument: through form-based training, LLMs can exhibit generalization leveraging abstractions, albeit in unreliable ways based on contextual cues — what we term _stochastic chameleons_.
Beyond Naive Prompting: Strategies for Improved Zero-shot Context-aided Forecasting with LLMs
Andrew Robert Williams
Vincent Zhihao Zheng
Étienne Marcotte
Valentina Zantedeschi
Forecasting in real-world settings requires models to integrate not only historical data but also relevant contextual information, often ava… (voir plus)ilable in textual form. While recent work has shown that large language models (LLMs) can be effective context-aided forecasters via naïve direct prompting, their full potential remains underexplored. We address this gap with 4 strategies, providing new insights into the zero-shot capabilities of LLMs in this setting. ReDP improves interpretability by eliciting explicit reasoning traces, allowing us to assess the model's reasoning over the context independently from its forecast accuracy. CorDP leverages LLMs solely to refine existing forecasts with context, enhancing their applicability in real-world forecasting pipelines. IC-DP proposes embedding historical examples of context-aided forecasting tasks in the prompt, substantially improving accuracy even for the largest models. Finally, RouteDP optimizes resource efficiency by using LLMs to estimate task difficulty, and routing the most challenging tasks to larger models. Evaluated on different kinds of context-aided forecasting tasks from the CiK benchmark, our strategies demonstrate distinct benefits over naïve prompting across LLMs of different sizes and families. These results open the door to further simple yet effective improvements in LLM-based context-aided forecasting.
Conscious Data Contribution via Community-Driven Chain-of-Thought Distillation
Rushabh Solanki
Elliot Creager
Ulrich Matchi Aïvodji
Context-Aware World Models for Task-Agnostic Control
Busra Tugce Gurbuz
Christopher C. Pack
Eilif Benjamin Muller