Publications

Guided-topic modelling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes
Lakshmipuram Seshadri Swapna
Michael Huang
Cell-type composition is an important indicator of health. We present Guided Topic Model for deconvolution (GTM-decon) to automatically infe… (see more)r cell-type-specific gene topic distributions from single-cell RNA-seq data for deconvolving bulk transcriptomes. GTM-decon performs competitively on deconvolving simulated and real bulk data compared with the state-of-the-art methods. Moreover, as demonstrated in deconvolving disease transcriptomes, GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data as a guide to infer phenotype-specific gene distributions. In a nested-guided design, GTM-decon identified cell-type-specific differentially expressed genes from bulk breast cancer transcriptomes.
Hidden Symmetries of ReLU Networks
J. Grigsby
Elisenda Grigsby
Kathryn Lindsey
High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance
Abdurakhmon Sadiev
Marina Danilova
Eduard Gorbunov
Samuel Horváth
Pavel Dvurechensky
Alexander Gasnikov
Peter Richtárik
During the recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimiza… (see more)tion methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as boundedness of the gradient noise variance or of the objective’s gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central
Maximal Initial Learning Rates in Deep ReLU Networks
Gaurav Iyer
Boris Hanin
Mechanistic Mode Connectivity
Ekdeep Singh Lubana
Eric J Bigelow
Robert P. Dick
Hidenori Tanaka
Neural FIM for learning Fisher information metrics from point cloud data
Oluwadamilola Fasina
Guillaume Huguet
Alexander Tong
Yanlei Zhang
Maximilian Nickel
Ian Adelstein
Smita Krishnaswamy
Although data diffusion embeddings are ubiquitous in unsupervised learning and have proven to be a viable technique for uncovering the under… (see more)lying intrinsic geometry of data, diffusion embeddings are inherently limited due to their discrete nature. To this end, we propose neural FIM, a method for computing the Fisher information metric (FIM) from point cloud data - allowing for a continuous manifold model for the data. Neural FIM creates an extensible metric space from discrete point cloud data such that information from the metric can inform us of manifold characteristics such as volume and geodesics. We demonstrate Neural FIM’s utility in selecting parameters for the PHATE visualization method as well as its ability to obtain information pertaining to local volume illuminating branching points and cluster centers embeddings of a toy dataset and two single-cell datasets of IPSC reprogramming and PBMCs (immune cells).
PAC-Bayesian Generalization Bounds for Adversarial Generative Models
Sokhna Diarra Mbacke
Florence Clerc
Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design
Chuan Guo
Kamalika Chaudhuri
Pierre Stock
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts
Minghao Xu
Xinyu Yuan
Santiago Miret
Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM’s original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.
Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts
Étienne Marcotte
Valentina Zantedeschi
Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expect… (see more)ation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the"region of reliability"of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.
Robust Perception through Equivariance
Lingyu Zhang
Abhishek Vaibhav Joshi
Junfeng Yang
Hao Wang
Carl Vondrick
R-U-SURE? Uncertainty-Aware Code Suggestions By Maximizing Utility Across Random User Intents
Daniel D. Johnson
Danny Tarlow
Christian Walder