Crystal Design Amidst Noisy DFT Signals: A Reinforcement Learning Approach
Prashant Govindarajan
Mathieu Reymond
Santiago Miret
Mariano Phielipp
ImmunoStruct: Integration of protein sequence, structure, and biochemical properties for immunogenicity prediction and interpretation
Kevin Bijan Givechian
João Felipe Rocha
Edward Yang
Chen Liu
Kerrie Greene
Rex Ying
Etienne Caron
Akiko Iwasaki
ImmunoStruct: Integration of protein sequence, structure, and biochemical properties for immunogenicity prediction and interpretation
Kevin B. Givechian
João F. Rocha
Edward Yang
Chen Liu
Kerrie Greene
Rex Ying
Etienne Caron
Akiko Iwasaki
Epitope-based vaccines are promising therapeutic modalities for infectious diseases and cancer, but identifying immunogenic epitopes is chal… (voir plus)lenging. The vast majority of prediction methods are sequence-based, and do not incorporate wide-scale structure data and biochemical properties across each peptide-MHC (pMHC) complex. We present ImmunoStruct, a deep-learning model that integrates sequence, structural, and biochemical information to predict multi-allele class-I pMHC immunogenicity. By leveraging a multimodal dataset of ∼ 27,000 peptide-MHC complexes that we generated with AlphaFold, we demonstrate that ImmunoStruct improves immunogenicity prediction performance and interpretability beyond existing methods, across infectious disease epitopes and cancer neoepitopes. We further show strong alignment with in vitro assay results for a set of SARS-CoV-2 epitopes. This work also presents a new architecture that incorporates equivariant graph processing and multi-modal data integration for the long standing task in immunotherapy.
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Luke Marks
Alasdair Paren
Fazl Barez
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that a… (voir plus)re not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
Luke Marks
Alisdair Paren
Fazl Barez
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that a… (voir plus)re not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization} \textbf{(MFR)}, a regularization technique for improving feature learning by encouraging SAEs trained in parallel to learn similar features. We motivate \textsc{MFR} by showing that features learned by multiple SAEs are more likely to correlate with features of the input. By training on synthetic data with known features of the input, we show that \textsc{MFR} can help SAEs learn those features, as we can directly compare the features learned by the SAE with the input features for the synthetic data. We then scale \textsc{MFR} to SAEs that are trained to denoise electroencephalography (EEG) data and SAEs that are trained to reconstruct GPT-2 Small activations. We show that \textsc{MFR} can improve the reconstruction loss of SAEs by up to 21.21\% on GPT-2 Small, and 6.67\% on EEG data. Our results suggest that the similarity between features learned by different SAEs can be leveraged to improve SAE training, thereby enhancing performance and the usefulness of SAEs for model interpretability.
Molphenix: A Multimodal Foundation Model for PhenoMolecular Retrieval
Philip Fradkin
Puria Azadi Moghadam
Karush Suri
Frederik Wenkel
Maciej Sypetkowski
Predicting molecular impact on cellular function is a core challenge in therapeutic design. Phenomic experiments, designed to capture cellu… (voir plus)lar morphology, utilize microscopy based techniques and demonstrate a high throughput solution for uncovering molecular impact on the cell. In this work, we learn a joint latent space between molecular structures and microscopy phenomic experiments, aligning paired samples with contrastive learning. Specifically, we study the problem of Contrastive PhenoMolecular Retrieval, which consists of zero-shot molecular structure identification conditioned on phenomic experiments. We assess challenges in multi-modal learning of phenomics and molecular modalities such as experimental batch effect, inactive molecule perturbations, and encoding perturbation concentration. We demonstrate improved multi-modal learner retrieval through (1) a uni-modal pre-trained phenomics model, (2) a novel inter sample similarity aware loss, and (3) models conditioned on a representation of molecular concentration. Following this recipe, we propose MolPhenix, a molecular phenomics model. MolPhenix leverages a pre-trained phenomics model to demonstrate significant performance gains across perturbation concentrations, molecular scaffolds, and activity thresholds. In particular, we demonstrate an 8.1
Trained Without My Consent: Detecting Code Inclusion In Language Models Trained on Code
Vahid Majdinasab
Amin Nikanjam
Code auditing ensures that the developed code adheres to standards, regulations, and copyright protection by verifying that it does not cont… (voir plus)ain code from protected sources. The recent advent of Large Language Models (LLMs) as coding assistants in the software development process poses new challenges for code auditing. The dataset for training these models is mainly collected from publicly available sources. This raises the issue of intellectual property infringement as developers' codes are already included in the dataset. Therefore, auditing code developed using LLMs is challenging, as it is difficult to reliably assert if an LLM used during development has been trained on specific copyrighted codes, given that we do not have access to the training datasets of these models. Given the non-disclosure of the training datasets, traditional approaches such as code clone detection are insufficient for asserting copyright infringement. To address this challenge, we propose a new approach, TraWiC; a model-agnostic and interpretable method based on membership inference for detecting code inclusion in an LLM's training dataset. We extract syntactic and semantic identifiers unique to each program to train a classifier for detecting code inclusion. In our experiments, we observe that TraWiC is capable of detecting 83.87% of codes that were used to train an LLM. In comparison, the prevalent clone detection tool NiCad is only capable of detecting 47.64%. In addition to its remarkable performance, TraWiC has low resource overhead in contrast to pair-wise clone detection that is conducted during the auditing process of tools like CodeWhisperer reference tracker, across thousands of code snippets.
AI-EDI-SPACE: A Co-designed Dataset for Evaluating the Quality of Public Spaces
S. Gowaikar
Hugo Berard
Rashid A. Mushkani
Emmanuel Beaudry Marchand
Toumadher Ammar
Advancements in AI heavily rely on large-scale datasets meticulously curated and annotated for training. However, concerns persist regarding… (voir plus) the transparency and context of data collection methodologies, especially when sourced through crowdsourcing platforms. Crowdsourcing often employs low-wage workers with poor working conditions and lacks consideration for the representativeness of annotators, leading to algorithms that fail to represent diverse views and perpetuate biases against certain groups. To address these limitations, we propose a methodology involving a co-design model that actively engages stakeholders at key stages, integrating principles of Equity, Diversity, and Inclusion (EDI) to ensure diverse viewpoints. We apply this methodology to develop a dataset and AI model for evaluating public space quality using street view images, demonstrating its effectiveness in capturing diverse perspectives and fostering higher-quality data.
AI-EDI-SPACE: A Co-designed Dataset for Evaluating the Quality of Public Spaces
S. Gowaikar
Hugo Berard
Rashid A. Mushkani
Emmanuel Beaudry Marchand
Toumadher Ammar
Advancements in AI heavily rely on large-scale datasets meticulously curated and annotated for training. However, concerns persist regarding… (voir plus) the transparency and context of data collection methodologies, especially when sourced through crowdsourcing platforms. Crowdsourcing often employs low-wage workers with poor working conditions and lacks consideration for the representativeness of annotators, leading to algorithms that fail to represent diverse views and perpetuate biases against certain groups. To address these limitations, we propose a methodology involving a co-design model that actively engages stakeholders at key stages, integrating principles of Equity, Diversity, and Inclusion (EDI) to ensure diverse viewpoints. We apply this methodology to develop a dataset and AI model for evaluating public space quality using street view images, demonstrating its effectiveness in capturing diverse perspectives and fostering higher-quality data.
Association Between Circulating Vitamin K Levels, Gut Microbiome, and Type 1 Diabetes: A Mendelian Randomization Study
Samuel De La Barrera
Benjamin De La Barrera
Isabel Gamache
Despoina Manousaki
Community-based reconstruction and simulation of a full-scale model of the rat hippocampus CA1 region
Armando Romani
A. Antonietti
Davide Bella
Julian Budd
Elisabetta Giacalone
Kerem Kurban
Sára Sáray
Marwan Abdellah
Alexis Arnaudon
Elvis Boci
Cristina Colangelo
Jean-Denis Courcol
Thomas Delemontex
András Ecker
Joanne Falck
Cyrille Favreau
Michael Gevaert
Juan B. Hernando
Joni Herttuainen
Genrich Ivaska … (voir 28 de plus)
Lida Kanari
Anna-Kristin Kaufmann
James King
Pramod Kumbhar
Sigrun Lange
Huanxiang Lu
Carmen Alina Lupascu
Rosanna Migliore
Fabien Petitjean
Judit Planas
Pranav Rai
Srikanth Ramaswamy
Michael W. Reimann
Juan Luis Riquelme
Nadir Román Guerrero
Ying Shi
Vishal Sood
Mohameth François Sy
Werner Van Geit
Liesbeth Vanherpe
Tamás F. Freund
Audrey Mercer
Felix Schürmann
Alex M. Thomson
Michele Migliore
Szabolcs Káli
Henry Markram
The CA1 region of the hippocampus is one of the most studied regions of the rodent brain, thought to play an important role in cognitive fun… (voir plus)ctions such as memory and spatial navigation. Despite a wealth of experimental data on its structure and function, it has been challenging to integrate information obtained from diverse experimental approaches. To address this challenge, we present a community-based, full-scale in silico model of the rat CA1 that integrates a broad range of experimental data, from synapse to network, including the reconstruction of its principal afferents, the Schaffer collaterals, and a model of the effects that acetylcholine has on the system. We tested and validated each model component and the final network model, and made input data, assumptions, and strategies explicit and transparent. The unique flexibility of the model allows scientists to potentially address a range of scientific questions. In this article, we describe the methods used to set up simulations to reproduce in vitro and in vivo experiments. Among several applications in the article, we focus on theta rhythm, a prominent hippocampal oscillation associated with various behavioral correlates and use our computer model to reproduce experimental findings. Finally, we make data, code, and model available through the hippocampushub.eu portal, which also provides an extensive set of analyses of the model and a user-friendly interface to facilitate adoption and usage. This community-based model represents a valuable tool for integrating diverse experimental data and provides a foundation for further research into the complex workings of the hippocampal CA1 region.
Community-based reconstruction and simulation of a full-scale model of the rat hippocampus CA1 region
Armando Romani
Alberto Antonietti
Davide Bella
Julian Budd
Elisabetta Giacalone
Kerem Kurban
Sára Sáray
Marwan Abdellah
Alexis Arnaudon
Elvis Boci
Cristina Colangelo
Jean-Denis Courcol
Thomas Delemontex
András Ecker
Joanne Falck
Cyrille Favreau
Michael Gevaert
Juan B. Hernando
Joni Herttuainen
Genrich Ivaska … (voir 28 de plus)
Lida Kanari
Anna-Kristin Kaufmann
James King
Pramod Kumbhar
Sigrun Lange
Huanxiang Lu
Carmen Alina Lupascu
Rosanna Migliore
Fabien Petitjean
Judit Planas
Pranav Rai
Srikanth Ramaswamy
Michael W. Reimann
Juan Luis Riquelme
Nadir Román Guerrero
Ying Shi
Vishal Sood
Mohameth François Sy
Werner Van Geit
Liesbeth Vanherpe
Tamás F. Freund
Audrey Mercer
Felix Schürmann
Alex M. Thomson
Michele Migliore
Szabolcs Káli
Henry Markram
The CA1 region of the hippocampus is one of the most studied regions of the rodent brain, thought to play an important role in cognitive fun… (voir plus)ctions such as memory and spatial navigation. Despite a wealth of experimental data on its structure and function, it has been challenging to integrate information obtained from diverse experimental approaches. To address this challenge, we present a community-based, full-scale in silico model of the rat CA1 that integrates a broad range of experimental data, from synapse to network, including the reconstruction of its principal afferents, the Schaffer collaterals, and a model of the effects that acetylcholine has on the system. We tested and validated each model component and the final network model, and made input data, assumptions, and strategies explicit and transparent. The unique flexibility of the model allows scientists to potentially address a range of scientific questions. In this article, we describe the methods used to set up simulations to reproduce in vitro and in vivo experiments. Among several applications in the article, we focus on theta rhythm, a prominent hippocampal oscillation associated with various behavioral correlates and use our computer model to reproduce experimental findings. Finally, we make data, code, and model available through the hippocampushub.eu portal, which also provides an extensive set of analyses of the model and a user-friendly interface to facilitate adoption and usage. This community-based model represents a valuable tool for integrating diverse experimental data and provides a foundation for further research into the complex workings of the hippocampal CA1 region.