Elliot Layne

Sparsity regularization via tree-structured environments for disentangled representations

Jason Hartford

Many causal systems such as biological processes in cells can only be observed indirectly via measurements, such as gene expression. Causal … (see more)representation learning---the task of correctly mapping low-level observations to latent causal variables---could advance scientific understanding by enabling inference of latent variables such as pathway activation. In this paper, we develop methods for inferring latent variables from multiple related datasets (environments) and tasks. As a running example, we consider the task of predicting a phenotype from gene expression, where we often collect data from multiple cell types or organisms that are related in known ways. The key insight is that the mapping from latent variables driven by gene expression to the phenotype of interest changes sparsely across closely related environments. To model sparse changes, we introduce Tree-Based Regularization (TBR), an objective that minimizes both prediction error and regularizes closely related environments to learn similar predictors. We prove that under assumptions about the degree of sparse changes, TBR identifies the true latent variables up to some simple transformations. We evaluate the theory empirically with both simulations and ground-truth gene expression data. We find that TBR recovers the latent causal variables better than related methods across these settings, even under settings that violate some assumptions of the theory.

2025-07-26

TMLR (accepted)

openreview.net

Sparsity regularization via tree-structured environments for disentangled representations

Jason Hartford

Many causal systems such as biological processes in cells can only be observed indirectly via measurements, such as gene expression. Causal … (see more)representation learning -- the task of correctly mapping low-level observations to latent causal variables -- could advance scientific understanding by enabling inference of latent variables such as pathway activation. In this paper, we develop methods for inferring latent variables from multiple related datasets (environments) and tasks. As a running example, we consider the task of predicting a phenotype from gene expression, where we often collect data from multiple cell types or organisms that are related in known ways. The key insight is that the mapping from latent variables driven by gene expression to the phenotype of interest changes sparsely across closely related environments. To model sparse changes, we introduce Tree-Based Regularization (TBR), an objective that minimizes both prediction error and regularizes closely related environments to learn similar predictors. We prove that under assumptions about the degree of sparse changes, TBR identifies the true latent variables up to some simple transformations. We evaluate the theory empirically with both simulations and ground-truth gene expression data. We find that TBR recovers the latent causal variables better than related methods across these settings, even under settings that violate some assumptions of the theory.

2025-01-01

Trans. Mach. Learn. Res. (published)

openreview.net

Sparsity regularization via tree-structured environments for disentangled representations

Elliot Layne

Dhanya Sridhar

Jason Hartford

Mathieu Blanchette

2024-05-30

ArXiv (preprint)

arxiv.org

Multi-ancestry polygenic risk scores using phylogenetic regularization

2024-02-17

bioRxiv (preprint)

doi.org

PhyloGFN: Phylogenetic inference with generative flow networks

Ming Yang Zhou

Moksh J. Jain

2024-01-16

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

PhyloGFN: Phylogenetic inference with generative flow networks

Ming Yang Zhou

Moksh J. Jain

Phylogenetics is a branch of computational biology that studies the evolutionary relationships among biological entities. Its long history a… (see more)nd numerous applications notwithstanding, inference of phylogenetic trees from sequence data remains challenging: the high complexity of tree space poses a significant obstacle for the current combinatorial and probabilistic techniques. In this paper, we adopt the framework of generative flow networks (GFlowNets) to tackle two core problems in phylogenetics: parsimony-based and Bayesian phylogenetic inference. Because GFlowNets are well-suited for sampling complex combinatorial structures, they are a natural choice for exploring and sampling from the multimodal posterior distribution over tree topologies and evolutionary distances. We demonstrate that our amortized posterior sampler, PhyloGFN, produces diverse and high-quality evolutionary hypotheses on real benchmark datasets. PhyloGFN is competitive with prior works in marginal likelihood estimation and achieves a closer fit to the target distribution than state-of-the-art variational inference methods. Our code is available at https://github.com/zmy1116/phylogfn.

2023-10-12

ArXiv (preprint)

doi.org

arxiv.org

Leveraging Structure Between Environments: Phylogenetic Regularization Incentivizes Disentangled Representations

Jason Hartford

Recently, learning invariant predictors across varying environments has been shown to improve the generalization of supervised learning meth… (see more)ods. This line of investigation holds great potential for application to biological problem settings, where data is often naturally heterogeneous. Biological samples often originate from different distributions, or environments. However, in biological contexts, the standard "invariant prediction" setting may not completely fit: the optimal predictor may in fact vary across biological environments. There also exists strong domain knowledge about the relationships between environments, such as the evolutionary history of a set of species, or the differentiation process of cell types. Most work on generic invariant predictors have not assumed the existence of structured relationships between environments. However, this prior knowledge about environments themselves has already been shown to improve prediction through a particular form of regularization applied when learning a set of predictors. In this work, we empirically evaluate whether a regularization strategy that exploits environment-based prior information can be used to learn representations that better disentangle causal factors that generate observed data. We find evidence that these methods do in fact improve the disentanglement of latent embeddings. We also show a setting where these methods can leverage phylogenetic information to estimate the number of latent causal features.

2022-07-09

auai.org/UAI/2022/Workshop/CRL (poster)

doi.org

openreview.net

Speed Science

Leading in a New Era

Supervision Requests

Elliot Layne

Publications

Speed Science

Leading in a New Era

Supervision Requests

Popular keywords:

Elliot Layne

Publications