Publications

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Jama Hussein Mohamud

Michael Craig

Cristian Gabellini

Kerstin Klasers

Josef Dean

Cas Wognum … (voir 15 de plus)

Maciej Sypetkowski

Ioannis Koutis

Hadrien Mary

Therence Bois

Andrew Fitzgibbon

Błażej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have shown significant advancements in multiple fields. However, the lack of datasets with labeled f… (voir plus)eatures and codebases has hindered the development of a supervised foundation model for molecular tasks. Here, we have carefully curated seven datasets specifically tailored for node- and graph-level prediction tasks to facilitate supervised learning on molecules. Moreover, to support the development of multi-task learning on our proposed datasets, we created the Graphium graph machine learning library. Our dataset collection encompasses two distinct categories. Firstly, the TOYMIX category modifies three small existing datasets with additional data for multi-task learning. Secondly, the LARGEMIX category includes four large-scale datasets with 344M graph-level data points and 409M node-level data points from ∼5M unique molecules. Finally, the ultra-large dataset contains 2,210M graph-level data points and 2,031M node-level data points coming from 86M molecules. Hence our datasets represent an order of magnitude increase in data volume compared to other 2D-GNN datasets. In addition, recognizing that molecule-related tasks often span multiple levels, we have designed our library to explicitly support multi-tasking, offering a diverse range of multi-level representations, i.e., representations at the graph, node, edge, and node-pair level. We equipped the library with an extensive collection of models and features to cover different levels of molecule analysis. By combining our curated datasets with this versatile library, we aim to accelerate the development of molecule foundation models. Datasets and code are available at https://github.com/datamol-io/graphium.

2024-01-15

ICLR.cc/2024/Conference (poster)

doi.org

openreview.net

BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.

Kim A. Tran

Erwan Pernet

Mina Sadeghi

Jeffrey Downey

Julia Chronopoulos

Elizabeth Lapshina

Oscar Tsai

Eva Kaufmann

Jun Ding

Maziar Divangahi

2024-01-14

Nature Immunology (publié)

doi.org

Self-Supervised Anomaly Detection: A Survey and Outlook

Hadi Hojjati

Thi Kieu Khanh Ho

Naregs Armanfard

Anomaly detection (AD) plays a crucial role in various domains, including cybersecurity, finance, and healthcare, by identifying patterns or… (voir plus) events that deviate from normal behaviour. In recent years, significant progress has been made in this field due to the remarkable growth of deep learning models. Notably, the advent of self-supervised learning has sparked the development of novel AD algorithms that outperform the existing state-of-the-art approaches by a considerable margin. This paper aims to provide a comprehensive review of the current methodologies in self-supervised anomaly detection. We present technical details of the standard methods and discuss their strengths and drawbacks. We also compare the performance of these models against each other and other state-of-the-art anomaly detection models. Finally, the paper concludes with a discussion of future directions for self-supervised anomaly detection, including the development of more effective and efficient algorithms and the integration of these techniques with other related fields, such as multi-modal learning.

2024-01-14

Neural Networks (inconnu)

doi.org

arxiv.org

Computational pathology: A survey review and the way forward

Mahdi S. Hosseini

Babak Ehteshami Bejnordi

Vincent Quoc-Huy Trinh

Danial Hasan

Xingwen Li

Taehyo Kim

Haochen Zhang

Theodore Wu

Kajanan Chinniah

Sina Maghsoudlou

Ryan Zhang

Stephen Yang

Jiadai Zhu

Lyndon Chan

Samir Khaki

Andrei Buin

Fatemeh Chaji

Ala Salehi

Alejandra Zambrano Luna

Bich Ngoc Nguyen … (voir 2 de plus)

Dimitris Samaras

Konstantinos N. Plataniotis

2024-01-13

Journal of Pathology Informatics (publié)

doi.org

arxiv.org

Assessing the quality and value of metabolic chart data for capturing core outcomes for pediatric medium-chain acyl-CoA dehydrogenase (MCAD) deficiency

Ryan Iverson

Monica Taljaard

Michael T. Geraghty

Michael Pugliese

Kylie Tingley

Doug Coyle

Jonathan B. Kronick

Kumanan Wilson

Valerie Austin

Catherine Brunel-Guitton

Daniela Buhas

Nancy J. Butcher

Alicia K. J. Chan

Sarah Dyack

Sharan Goobie

Cheryl Greenberg

Shailly Jain-Ghai

Michal Inbar-Feigenberg

Natalya Karp

Mariya Kozenko … (voir 30 de plus)

Erica Langley

Matthew Lines

Julian Little

Jennifer MacKenzie

Bruno Maranda

Saadet Mercimek-Andrews

Aizeddin Mhanni

John J. Mitchell

Laura Nagy

Martin Offringa

Amy Pender

Murray Potter

Chitra Prasad

Suzanne Ratko

Ramona Salvarinova

Andreas Schulze

Komudi Siriwardena

Neal Sondheimer

Rebecca Sparkes

Sylvia Stockler-Ipsiroglu

Kendra Tapscott

Yannis Trakadis

Lesley Turner

Clara Van Karnebeek

Anthony Vandersteen

Jagdeep S. Walia

Brenda J. Wilson

Andrea C. Yu

Beth K. Potter

Pranesh Chakraborty

2024-01-12

BMC Pediatrics (publié)

doi.org

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

Mauricio Rivera

Jean-François Godbout

Reihaneh Rabbany

Kellin Pelrine

2024-01-12

ArXiv (prépublication)

doi.org

arxiv.org

Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation

Tyler Vergho

Jean-François Godbout

Reihaneh Rabbany

Kellin Pelrine

Recent large language models (LLMs) have been shown to be effective for misinformation detection. However, the choice of LLMs for experiment… (voir plus)s varies widely, leading to uncertain conclusions. In particular, GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. Meanwhile, alternative LLMs have given mixed results. In this work, we show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like Llama-2 and GPT-3.5. This provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. Finally, we validate new tools including approaches to structured output and the latest version of GPT-4 (Turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.

2024-01-11

ArXiv (prépublication)

doi.org

arxiv.org

Personalized inference for neurostimulation with meta-learning: a case study of vagus nerve stimulation

Ximeng Mao

Yao-Chuan Chang

Stavros Zanos

Guillaume Lajoie

2024-01-11

Journal of Neural Engineering (publié)

doi.org

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab

Timothée Darcet

Théo Moutakanni

Huy V. Vo

Marc Szafraniec

Vasil Khalidov

Pierre Fernandez

Daniel HAZIZA

Francisco Massa

Alaaeldin El-Nouby

Mahmoud Assran

Nicolas Ballas

Wojciech Galuba

Russell Howes

Po-Yao Huang

Shang-Wen Li

Ishan Misra

Michael G. Rabbat

Vasu Sharma

Gabriel Synnaeve … (voir 8 de plus)

Hu Xu 0001

Huijiao Xu

Hu Xu

Herve Jegou

Julien Mairal

Patrick Labatut

Armand Joulin

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (voir plus)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

2024-01-10

TMLR (accepté)

doi.org

openreview.net

A database of the healthy human spinal cord morphometry in the PAM50 template space

Jan Valosek

Sandrine Bédard

Miloš Keřkovský

Tomáš Rohan

Julien Cohen-Adad

Measures of spinal cord morphometry computed from magnetic resonance images serve as relevant prognostic biomarkers for a range of spinal co… (voir plus)rd pathologies, including traumatic and non-traumatic spinal cord injury and neurodegenerative diseases. However, interpreting these imaging biomarkers is difficult due to considerable intra- and inter-subject variability. Yet, there is no clear consensus on a normalization method that would help reduce this variability and more insights into the distribution of these morphometrics are needed. In this study, we computed a database of normative values for six commonly used measures of spinal cord morphometry: cross-sectional area, anteroposterior diameter, transverse diameter, compression ratio, eccentricity, and solidity. Normative values were computed from a large open-access dataset of healthy adult volunteers (N = 203) and were brought to the common space of the PAM50 spinal cord template using a newly proposed normalization method based on linear interpolation. Compared to traditional image-based registration, the proposed normalization approach does not involve image transformations and, therefore, does not introduce distortions of spinal cord anatomy. This is a crucial consideration in preserving the integrity of the spinal cord anatomy in conditions such as spinal cord injury. This new morphometric database allows researchers to normalize based on sex and age, thereby minimizing inter-subject variability associated with demographic and biological factors. The proposed methodology is open-source and accessible through the Spinal Cord Toolbox (SCT) v6.0 and higher.

2024-01-09

Imaging Neuroscience (publié)

doi.org

Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse Actions, Interventions and Sparse Temporal Dependencies

Sébastien Lachapelle

Pau Rodríguez

Yash Sharma

Katie Everett

Rémi Le Priol

Alexandre Lacoste

Simon Lacoste-Julien

2024-01-09

ArXiv (prépublication)

doi.org

arxiv.org

DyG2Vec: Efficient Representation Learning for Dynamic Graphs

Mohammad Alomrani

Mahdi Biparva

Yingxue Zhang

Mark J. Coates

Temporal graph neural networks have shown promising results in learning inductive representations by automatically extracting temporal patte… (voir plus)rns. However, previous works often rely on complex memory modules or inefficient random walk methods to construct temporal representations. To address these limitations, we present an efficient yet effective attention-based encoder that leverages temporal edge encodings and window-based subgraph sampling to generate task-agnostic embeddings. Moreover, we propose a joint-embedding architecture using non-contrastive SSL to learn rich temporal embeddings without labels. Experimental results on 7 benchmark datasets indicate that on average, our model outperforms SoTA baselines on the future link prediction task by 4.23% for the transductive setting and 3.30% for the inductive setting while only requiring 5-10x less training/inference time. Lastly, different aspects of the proposed framework are investigated through experimental analysis and ablation studies. The code is publicly available at https://github.com/huawei-noah/noah-research/tree/master/graph_atlas.

2024-01-07

TMLR (accepté)

openreview.net

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications