Publications

Guided-topic modelling of single-cell transcriptomes enables sub-cell-type and disease-subtype deconvolution of bulk transcriptomes

Lakshmipuram Seshadri Swapna

Michael Huang

Yue Li

Cell-type composition is an important indicator of health. We present Guided Topic Model for deconvolution (GTM-decon) to automatically infe… (see more)r cell-type-specific gene topic distributions from single-cell RNA-seq data for deconvolving bulk transcriptomes. GTM-decon performs competitively on deconvolving simulated and real bulk data compared with the state-of-the-art methods. Moreover, as demonstrated in deconvolving disease transcriptomes, GTM-decon can infer multiple cell-type-specific gene topic distributions per cell type, which captures sub-cell-type variations. GTM-decon can also use phenotype labels from single-cell or bulk data as a guide to infer phenotype-specific gene distributions. In a nested-guided design, GTM-decon identified cell-type-specific differentially expressed genes from bulk breast cancer transcriptomes.

2023-07-03

bioRxiv (preprint)

doi.org

Hidden Symmetries of ReLU Networks

J. Grigsby

Elisenda Grigsby

Kathryn Lindsey

David Rolnick

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

High-Probability Bounds for Stochastic Optimization and Variational Inequalities: the Case of Unbounded Variance

Abdurakhmon Sadiev

Marina Danilova

Eduard Gorbunov

Samuel Horváth

Gauthier Gidel

Pavel Dvurechensky

Alexander Gasnikov

Peter Richtárik

During the recent years the interest of optimization and machine learning communities in high-probability convergence of stochastic optimiza… (see more)tion methods has been growing. One of the main reasons for this is that high-probability complexity bounds are more accurate and less studied than in-expectation ones. However, SOTA high-probability non-asymptotic convergence results are derived under strong assumptions such as boundedness of the gradient noise variance or of the objective’s gradient itself. In this paper, we propose several algorithms with high-probability convergence results under less restrictive assumptions. In particular, we derive new high-probability convergence results under the assumption that the gradient/operator noise has bounded central

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Maximal Initial Learning Rates in Deep ReLU Networks

Gaurav Iyer

Boris Hanin

David Rolnick

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Mechanistic Mode Connectivity

Ekdeep Singh Lubana

Eric J Bigelow

Robert P. Dick

David Scott Krueger

Hidenori Tanaka

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Neural FIM for learning Fisher information metrics from point cloud data

Oluwadamilola Fasina

Guillaume Huguet

Alexander Tong

Yanlei Zhang

Guy Wolf

Maximilian Nickel

Ian Adelstein

Smita Krishnaswamy

Although data diffusion embeddings are ubiquitous in unsupervised learning and have proven to be a viable technique for uncovering the under… (see more)lying intrinsic geometry of data, diffusion embeddings are inherently limited due to their discrete nature. To this end, we propose neural FIM, a method for computing the Fisher information metric (FIM) from point cloud data - allowing for a continuous manifold model for the data. Neural FIM creates an extensible metric space from discrete point cloud data such that information from the metric can inform us of manifold characteristics such as volume and geodesics. We demonstrate Neural FIM’s utility in selecting parameters for the PHATE visualization method as well as its ability to obtain information pertaining to local volume illuminating branching points and cluster centers embeddings of a toy dataset and two single-cell datasets of IPSC reprogramming and PBMCs (immune cells).

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

PAC-Bayesian Generalization Bounds for Adversarial Generative Models

Sokhna Diarra Mbacke

Florence Clerc

Pascal Germain

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Privacy-Aware Compression for Federated Learning Through Numerical Mechanism Design

Chuan Guo

Kamalika Chaudhuri

Pierre Stock

Michael Rabbat

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

proceedings.mlr.press

openreview.net

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

Minghao Xu

Xinyu Yuan

Santiago Miret

Jian Tang

Current protein language models (PLMs) learn protein representations mainly based on their sequences, thereby well capturing co-evolutionary… (see more) information, but they are unable to explicitly acquire protein functions, which is the end goal of protein representation learning. Fortunately, for many proteins, their textual property descriptions are available, where their various functions are also described. Motivated by this fact, we first build the ProtDescribe dataset to augment protein sequences with text descriptions of their functions and other important properties. Based on this dataset, we propose the ProtST framework to enhance Protein Sequence pre-training and understanding by biomedical Texts. During pre-training, we design three types of tasks, i.e., unimodal mask prediction, multimodal representation alignment and multimodal mask prediction, to enhance a PLM with protein property information with different granularities and, at the same time, preserve the PLM’s original representation power. On downstream tasks, ProtST enables both supervised learning and zero-shot prediction. We verify the superiority of ProtST-induced PLMs over previous ones on diverse representation learning benchmarks. Under the zero-shot setting, we show the effectiveness of ProtST on zero-shot protein classification, and ProtST also enables functional protein retrieval from a large-scale database without any function annotation.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Regions of Reliability in the Evaluation of Multivariate Probabilistic Forecasts

Étienne Marcotte

Valentina Zantedeschi

Alexandre Drouin

Nicolas Chapados

Multivariate probabilistic time series forecasts are commonly evaluated via proper scoring rules, i.e., functions that are minimal in expect… (see more)ation for the ground-truth distribution. However, this property is not sufficient to guarantee good discrimination in the non-asymptotic regime. In this paper, we provide the first systematic finite-sample study of proper scoring rules for time-series forecasting evaluation. Through a power analysis, we identify the"region of reliability"of a scoring rule, i.e., the set of practical conditions where it can be relied on to identify forecasting errors. We carry out our analysis on a comprehensive synthetic benchmark, specifically designed to test several key discrepancies between ground-truth and forecast distributions, and we gauge the generalizability of our findings to real-world tasks with an application to an electricity production problem. Our results reveal critical shortcomings in the evaluation of multivariate probabilistic forecasts as commonly performed in the literature.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

doi.org

openreview.net

Robust Perception through Equivariance

Chengzhi Mao

Lingyu Zhang

Abhishek Vaibhav Joshi

Junfeng Yang

Hao Wang

Carl Vondrick

2023-07-03