Guy Wolf

Biography

Guy Wolf is an associate professor in the Department of Mathematics and Statistics at Université de Montréal.

His research interests lie at the intersection of machine learning, data science and applied mathematics. He is particularly interested in data mining methods that use manifold learning and deep geometric learning, as well as applications for the exploratory analysis of biomedical data.

Wolf’s research focuses on exploratory data analysis and its applications in bioinformatics. His approaches are multidisciplinary and bring together machine learning, signal processing and applied math tools. His recent work has used a combination of diffusion geometries and deep learning to find emergent patterns, dynamics, and structure in big high dimensional- data (e.g., in single-cell genomics and proteomics).

Current Students

Ria Arora

Master's Research - Université de Montréal

Co-supervisor :

Liam Paull

Adrien Aumon

PhD - Université de Montréal

Semih Cantürk

PhD - Université de Montréal

semihcanturk00@gmail.com

Collaborating Alumni

Enrique Fita Sanmartin

Collaborating Alumni - Université de Montréal

Kameron Harris

Collaborating researcher - Western Washington University (faculty; assistant prof))

Co-supervisor :

PhD - Université de Montréal

Will Hua

Collaborating Alumni - McGill University

Xiaolong Huang

Master's Research - Concordia University

Principal supervisor :

Guillaume Huguet

PhD - Université de Montréal

Paul Janson

PhD - Concordia University

Principal supervisor :

Charles-Etienne Joseph

Master's Research - Université de Montréal

Principal supervisor :

M. Elyes Kanoun

Research Intern - Université de Montréal

Vincent Létourneau

Postdoctorate - Université de Montréal

Myriam Lizotte

PhD - Université de Montréal

Philippe Martin

PhD - Université de Montréal

Co-supervisor :

Paul François

Paria Mehrbod

Master's Research - Concordia University

Principal supervisor :

Lydia Mezrag

PhD - Université de Montréal

Sacha Morin

PhD - Université de Montréal

Co-supervisor :

Postdoctorate - Concordia University

Principal supervisor :

geraldin.nanfack@mila.quebec

Amine Natik

PhD - Université de Montréal

Principal supervisor :

Guillaume Lajoie

Shuang Ni

PhD - Université de Montréal

Albert Orozco Camacho

PhD - Concordia University

Principal supervisor :

Master's Research - Université de Montréal

Matthew Scicluna

PhD - Université de Montréal

Principal supervisor :

Collaborating researcher - Université de Montréal

Co-supervisor :

PhD - Université de Montréal

Research Intern - Western Washington University

Principal supervisor :

Postdoctorate - Université de Montréal

stephanie.zandee@mcgill.ca

Stephanie Zandee

Collaborating researcher - McGill University (assistant professor)

Exploring the COVID-19 Interferon Paradox with Dimensionality Reduction and Clustering

Blog Posts

Graph and representation of working methodology, and graph of data on deaths 60 days after onset of symptoms.

February 19, 2025

Sacha Morin

Elsa Brunet-Ratnasingham

Guy Wolf

Read the article

Publications

Generalization of deep learning models for hepatic steatosis grading using B-mode ultrasound images

Pedro Vianna

Yijun Qi

Michael Chassé

An Tang

Guy Cloutier

2024-03-01

The Journal of the Acoustical Society of America (published)

Channel-Selective Normalization for Label-Shift Robust Test-Time Adaptation

Pedro Vianna

Muawiz Chaudhary

Paria Mehrbod

An Tang

Guy Cloutier

Michael Eickenberg

Deep neural networks have useful applications in many different tasks, however their performance can be severely affected by changes in the … (see more)data distribution. For example, in the biomedical field, their performance can be affected by changes in the data (different machines, populations) between training and test datasets. To ensure robustness and generalization to real-world scenarios, test-time adaptation has been recently studied as an approach to adjust models to a new data distribution during inference. Test-time batch normalization is a simple and popular method that achieved compelling performance on domain shift benchmarks. It is implemented by recalculating batch normalization statistics on test batches. Prior work has focused on analysis with test data that has the same label distribution as the training data. However, in many practical applications this technique is vulnerable to label distribution shifts, sometimes producing catastrophic failure. This presents a risk in applying test time adaptation methods in deployment. We propose to tackle this challenge by only selectively adapting channels in a deep network, minimizing drastic adaptation that is sensitive to label shifts. Our selection scheme is based on two principles that we empirically motivate: (1) later layers of networks are more sensitive to label shift (2) individual features can be sensitive to specific classes. We apply the proposed technique to three classification tasks, including CIFAR10-C, Imagenet-C, and diagnosis of fatty liver, where we explore both covariate and label distribution shifts. We find that our method allows to bring the benefits of TTA while significantly reducing the risk of failure common in other methods, while being robust to choice in hyperparameters.

2024-02-07

ArXiv (preprint)

Effective Protein-Protein Interaction Exploration with PPIretrieval

Chenqing Hua

Connor Coley

Doina Precup

Shuangjia Zheng

2024-02-06

ArXiv (preprint)

Effective Protein-Protein Interaction Exploration with PPIretrieval

Chenqing Hua

Connor W. Coley

Doina Precup

Shuangjia Zheng

Protein-protein interactions (PPIs) are crucial in regulating numerous cellular functions, including signal transduction, transportation, an… (see more)d immune defense. As the accuracy of multi-chain protein complex structure prediction improves, the challenge has shifted towards effectively navigating the vast complex universe to identify potential PPIs. Herein, we propose PPIretrieval, the first deep learning-based model for protein-protein interaction exploration, which leverages existing PPI data to effectively search for potential PPIs in an embedding space, capturing rich geometric and chemical information of protein surfaces. When provided with an unseen query protein with its associated binding site, PPIretrieval effectively identifies a potential binding partner along with its corresponding binding site in an embedding space, facilitating the formation of protein-protein complexes.

2024-02-06

ArXiv (preprint)

Effective Protein-Protein Interaction Exploration with PPIretrieval

Chenqing Hua

Connor W. Coley

Doina Precup

Shuangjia Zheng

2024-02-06

ArXiv (preprint)

Gaining Biological Insights through Supervised Data Visualization

Jake S. Rhodes

Adrien Aumon

Sacha Morin

Marc Girard

Catherine Larochelle

Boaz Lahav

Elsa Brunet-Ratnasingham

Amélie Pagliuzza

Lorie Marchitto

Wei Zhang

Adele Cutler

F. Grand'Maison

Anhong Zhou

Andrés Finzi

Nicolas Chomont

Daniel E. Kaufmann

Stephanie Zandee

Alexandre Prat

Kevin R. Moon

Dimensionality reduction-based data visualization is pivotal in comprehending complex biological data. The most common methods, such as PHAT… (see more)E, t-SNE, and UMAP, are unsupervised and therefore reflect the dominant structure in the data, which may be independent of expert-provided labels. Here we introduce a supervised data visualization method called RF-PHATE, which integrates expert knowledge for further exploration of the data. RF-PHATE leverages random forests to capture intricate featurelabel relationships. Extracting information from the forest, RF-PHATE generates low-dimensional visualizations that highlight relevant data relationships while disregarding extraneous features. This approach scales to large datasets and applies to classification and regression. We illustrate RF-PHATE’s prowess through three case studies. In a multiple sclerosis study using longitudinal clinical and imaging data, RF-PHATE unveils a sub-group of patients with non-benign relapsingremitting Multiple Sclerosis, demonstrating its aptitude for time-series data. In the context of Raman spectral data, RF-PHATE effectively showcases the impact of antioxidants on diesel exhaust-exposed lung cells, highlighting its proficiency in noisy environments. Furthermore, RF-PHATE aligns established geometric structures with COVID-19 patient outcomes, enriching interpretability in a hierarchical manner. RF-PHATE bridges expert insights and visualizations, promising knowledge generation. Its adaptability, scalability, and noise tolerance underscore its potential for widespread adoption.

2024-01-21

bioRxiv (preprint)

Gaining Biological Insights through Supervised Data Visualization

Jake S. Rhodes

Adrien Aumon

Sacha Morin

Marc Girard

Catherine Larochelle

Elsa Brunet-Ratnasingham

Amélie Pagliuzza

Lorie Marchitto

Wei Zhang

Adele Cutler

F. Grand'Maison

Anhong Zhou

Andrés Finzi

Nicolas Chomont

Daniel E. Kaufmann

Stephanie Zandee

Alexandre Prat

Kevin R. Moon

2024-01-21

bioRxiv (preprint)

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Luis Müller

Jama Hussein Mohamud

Ali Parviz

Michael Craig

Michał Koziarski

Jiarui Lu

Zhaocheng Zhu

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Guillaume Rabusseau

Reihaneh Rabbany

Jian Tang

Christopher Morris

Ioannis Koutis

Mirco Ravanelli

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

openreview.net

Assessing Neural Network Representations During Training Using Noise-Resilient Diffusion Spectral Entropy

Danqi Liao

Chen Liu

Benjamin W Christensen

Alexander Tong

Guillaume Huguet

Maximilian Nickel

Ian Adelstein

Smita Krishnaswamy

Entropy and mutual information in neural networks provide rich information on the learning process, but they have proven difficult to comput… (see more)e reliably in high dimensions. Indeed, in noisy and high-dimensional data, traditional estimates in ambient dimensions approach a fixed entropy and are prohibitively hard to compute. To address these issues, we leverage data geometry to access the underlying manifold and reliably compute these information-theoretic measures. Specifically, we define diffusion spectral entropy (DSE) in neural representations of a dataset as well as diffusion spectral mutual information (DSMI) between different variables representing data. First, we show that they form noise-resistant measures of intrinsic dimensionality and relationship strength in high-dimensional simulated data that outperform classic Shannon entropy, nonparametric estimation, and mutual information neural estimation (MINE). We then study the evolution of representations in classification networks with supervised learning, self-supervision, or overfitting. We observe that (1) DSE of neural representations increases during training; (2) DSMI with the class label increases during generalizable learning but stays stagnant during overfitting; (3) DSMI with the input signal shows differing trends: on MNIST it increases, while on CIFAR-10 and STL-10 it decreases. Finally, we show that DSE can be used to guide better network initialization and that DSMI can be used to predict downstream classification accuracy across 962 models on ImageNet.

2024-01-01

CISS (published)

openreview.net

Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension

Shuang Ni

Adrien Aumon

Kevin R. Moon

Jake S. Rhodes

The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Com… (see more)mon dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.

2024-01-01

MLSP (published)