Publications

GFETM: Genome Foundation-based Embedded Topic Model for scATAC-seq Modeling

Yimin Fan

Shi Han

Single-cell Assay for Transposase-Accessible Chromatin with sequencing (scATAC-seq) has emerged as a powerful technique for investigating op… (see more)en chromatin landscapes at single-cell resolution. However, analyzing scATAC-seq data remain challenging due to its sparsity and noise. Genome Foundation Models (GFMs), pre-trained on massive DNA sequences, have proven effective at genome analysis. Given that open chromatin regions (OCRs) harbour salient sequence features, we hypothesize that leveraging GFMs’ sequence embeddings can improve the accuracy and generalizability of scATAC-seq modeling. Here, we introduce the Genome Foundation Embedded Topic Model (GFETM), an interpretable deep learning framework that combines GFMs with the Embedded Topic Model (ETM) for scATAC-seq data analysis. By integrating the DNA sequence embeddings extracted by a GFM from OCRs, GFETM demonstrates superior accuracy and generalizability and captures cell-state specific TF activity both with zero-shot inference and attention mechanism analysis. Finally, the topic mixtures inferred by GFETM reveal biologically meaningful epigenomic signatures of kidney diabetes.

2023-12-31

Lecture Notes in Computer Science (published)

doi.org

GIST: Generated Inputs Sets Transferability in Deep Learning

Florian Tambon

Foutse Khomh

Giuliano Antoniol

2023-12-31

ACM Trans. Softw. Eng. Methodol. (published)

doi.org

arxiv.org

GrowSpace: A reinforcement learning environment for plant architecture

Mark Lefsrud

2023-12-31

Computers and Electronics in Agriculture (published)

doi.org

Harnessing small projectors and multiple views for efficient vision pretraining

Arna Ghosh

Kumar Krishna Agrawal

Shagun Sodhani

Adam M. Oberman

Blake A. Richards

Recent progress in self-supervised (SSL) visual representation learning has led to the development of several different proposed frameworks … (see more)that rely on augmentations of images but use different loss functions. However, there are few theoretically grounded principles to guide practice, so practical implementation of each SSL framework requires several heuristics to achieve competitive performance. In this work, we build on recent analytical results to design practical recommendations for competitive and efficient SSL that are grounded in theory. Specifically, recent theory tells us that existing SSL frameworks are minimizing the same idealized loss, which is to learn features that best match the data similarity kernel defined by the augmentations used. We show how this idealized loss can be reformulated to a functionally equivalent loss that is more efficient to compute. We study the implicit bias of using gradient descent to minimize our reformulated loss function and find that using a stronger orthogonalization constraint with a reduced projector dimensionality should yield good representations. Furthermore, the theory tells us that approximating the reformulated loss should be improved by increasing the number of augmentations, and as such using multiple augmentations should lead to improved convergence. We empirically verify our findings on CIFAR, STL and Imagenet datasets, wherein we demonstrate an improved linear readout performance when training a ResNet-backbone using our theoretically grounded recommendations. Remarkably, we also demonstrate that by leveraging these insights, we can reduce the pretraining dataset size by up to 2

2023-12-31

NeurIPS (published)

doi.org

openreview.net

High-Probability Convergence for Composite and Distributed Stochastic Minimization and Variational Inequalities with Heavy-Tailed Noise.

Eduard Gorbunov

Abdurakhmon Sadiev

Marina Danilova

Samuel Horváth

Gauthier Gidel

Pavel Dvurechensky

Alexander Gasnikov

Peter Richtárik

2023-12-31

International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Hint Marginalization for Improved Reasoning in Large Language Models

Soumyasundar Pal

Didier Chételat

Yingxue Zhang

Mark J. Coates

Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.

2023-12-31

arXiv.org (preprint)

doi.org

openreview.net

How Should We Extract Discrete Audio Tokens from Self-Supervised Models?

Jarod Duret

Yusuf Cem Sübakan

Mirco Ravanaelli

2023-12-31

INTERSPEECH (published)

doi.org

arxiv.org

Hybrid Simulator-Based Mechanism and Data-Driven for Multidemand Dioxin Emissions Intelligent Prediction in the MSWI Process

Heng Xia

Jian Tang

Wen Yu

JunFei Qiao

2023-12-31

IEEE transactions on industrial electronics (1982. Print) (published)

doi.org

HyperFusion: A Hypernetwork Approach to Multimodal Integration of Tabular and Medical Imaging Data for Predictive Modeling

Daniel Duenias

Brennan Nichyporuk

Tal Arbel

Tammy Riklin Raviv

The integration of diverse clinical modalities such as medical imaging and the tabular data extracted from patients' Electronic Health Recor… (see more)ds (EHRs) is a crucial aspect of modern healthcare. Integrative analysis of multiple sources can provide a comprehensive understanding of the clinical condition of a patient, improving diagnosis and treatment decision. Deep Neural Networks (DNNs) consistently demonstrate outstanding performance in a wide range of multimodal tasks in the medical domain. However, the complex endeavor of effectively merging medical imaging with clinical, demographic and genetic information represented as numerical tabular data remains a highly active and ongoing research pursuit. We present a novel framework based on hypernetworks to fuse clinical imaging and tabular data by conditioning the image processing on the EHR's values and measurements. This approach aims to leverage the complementary information present in these modalities to enhance the accuracy of various medical applications. We demonstrate the strength and generality of our method on two different brain Magnetic Resonance Imaging (MRI) analysis tasks, namely, brain age prediction conditioned by subject's sex and multi-class Alzheimer's Disease (AD) classification conditioned by tabular data. We show that our framework outperforms both single-modality models and state-of-the-art MRI tabular data fusion methods. A link to our code can be found at https://github.com/daniel4725/HyperFusion

2023-12-31

arXiv (preprint)

doi.org

arxiv.org

IDEA-DAC: Integrity-Driven Editing for Accountable Decentralized Anonymous Credentials via ZK-JSON

Shuhao Zheng

Zonglun Li

Junliang Luo

Ziyue Xin

Xue Liu

Decentralized Anonymous Credential (DAC) systems are increasingly relevant, especially when enhancing revocation mechanisms in the face of c… (see more)omplex traceability challenges. This paper introduces IDEA-DAC a paradigm shift from the conventional revoke-and-reissue methods, promoting direct and Integrity-Driven Editing (IDE) for Accountable DACs, which results in better integrity accountability, traceability, and system simplicity. We further incorporate an Edit-bound Conformity Check that ensures tailored integrity standards during credential amendments using R1CS-based ZK-SNARKs. Delving deeper, we propose ZK-JSON, a unique R1CS circuit design tailored for IDE over generic JSON documents. This design imposes strictly O(N) rank-1 constraints for variable-length JSON documents of up to N bytes in length, encompassing serialization, encryption, and edit-bound conformity checks. Additionally, our circuits only necessitate a one-time compilation, setup, and smart contract deployment for homogeneous JSON documents up to a specified size. While preserving core DAC features such as selective disclosure, anonymity, and predicate provability, IDEA-DAC achieves precise data modification checks without revealing private content, ensuring only authorized edits are permitted. In summary, IDEA-DAC offers an enhanced methodology for large-scale JSON-formatted credential systems, setting a new standard in decentralized identity management efficiency and precision.

2023-12-31

ACM Web Conference (published)

doi.org

An improved column-generation-based matheuristic for learning classification trees

Krunal Kishor Patel

Guy Desaulniers

Andrea Lodi

2023-12-31

Comput. Oper. Res. (published)

doi.org

arxiv.org

An Improved Neuro-Symbolic Architecture to Fine-Tune Generative AI Systems