Publications

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Luis Müller

Jama Hussein Mohamud

Ali Parviz

Michael Craig

Michał Koziarski

Jiarui Lu

Zhaocheng Zhu

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Guillaume Rabusseau

Jian Tang

Christopher Morris

Ioannis Koutis

Mirco Ravanelli

Guy Wolf

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

Tree Cross Attention

Leo Feng

Frederick Tung

Hossein Hajimirsadeghi

Yoshua Bengio

Mohamed Osama Ahmed

Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for e… (see more)ach prediction, Cross Attention scans the full set of

2024-01-16

ICLR.cc/2024/Conference (poster)

Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models

Pablo Pernias

Dominic Rampas

Mats Leon Richter

Chris Pal

Marc Aubreville

2024-01-16

ICLR.cc/2024/Conference (oral)

Sarath Chandar Anbil Parthipan

Are self-explanations from Large Language Models faithful?

Andreas Madsen

Siva Reddy

2024-01-15

ArXiv (preprint)

BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.

Kim A. Tran

Erwan Pernet

Mina Sadeghi

Jeffrey Downey

Julia Chronopoulos

Elizabeth Lapshina

Oscar Tsai

Eva Kaufmann

Jun Ding

Maziar Divangahi

2024-01-15

Nature Immunology (published)

Assessing the quality and value of metabolic chart data for capturing core outcomes for pediatric medium-chain acyl-CoA dehydrogenase (MCAD) deficiency

Ryan Iverson

Monica Taljaard

Michael T. Geraghty

Michael Pugliese

Kylie Tingley

Doug Coyle

Jonathan B. Kronick

Kumanan Wilson

Valerie Austin

Catherine Brunel-Guitton

Daniela Buhas

Nancy J. Butcher

Alicia K. J. Chan

Sarah Dyack

Sharan Goobie

Cheryl Greenberg

Shailly Jain-Ghai

Michal Inbar-Feigenberg

Natalya Karp

Mariya Kozenko … (see 30 more)

Erica Langley

Matthew Lines

Julian Little

Jennifer MacKenzie

Bruno Maranda

Saadet Mercimek-Andrews

Aizeddin Mhanni

John J. Mitchell

Laura Nagy

Martin Offringa

Amy Pender

Murray Potter

Chitra Prasad

Suzanne Ratko

Ramona Salvarinova

Andreas Schulze

Komudi Siriwardena

Neal Sondheimer

Rebecca Sparkes

Sylvia Stockler-Ipsiroglu

Kendra Tapscott

Yannis Trakadis

Lesley Turner

Clara Van Karnebeek

Anthony Vandersteen

Jagdeep S. Walia

Brenda J. Wilson

Andrea C. Yu

Beth K. Potter

Pranesh Chakraborty

2024-01-13

BMC Pediatrics (published)

Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation

Mauricio Rivera

Jean-François Godbout

Kellin Pelrine

2024-01-13

ArXiv (preprint)

Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation

Tyler Vergho

Jean-François Godbout

Kellin Pelrine

Recent large language models (LLMs) have been shown to be effective for misinformation detection. However, the choice of LLMs for experiment… (see more)s varies widely, leading to uncertain conclusions. In particular, GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. Meanwhile, alternative LLMs have given mixed results. In this work, we show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like Llama-2 and GPT-3.5. This provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. Finally, we validate new tools including approaches to structured output and the latest version of GPT-4 (Turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.

2024-01-12

ArXiv (preprint)

Laplacian Change Point Detection for Single and Multi-view Dynamic Graphs

Shenyang Huang

Samy Coulombe

Yasmeen Hitti

Guillaume Rabusseau

Dynamic graphs are rich data structures that are used to model complex relationships between entities over time. In particular, anomaly dete… (see more)ction in temporal graphs is crucial for many real-world applications such as intrusion identification in network systems, detection of ecosystem disturbances, and detection of epidemic outbreaks. In this article, we focus on change point detection in dynamic graphs and address three main challenges associated with this problem: (i) how to compare graph snapshots across time, (ii) how to capture temporal dependencies, and (iii) how to combine different views of a temporal graph. To solve the above challenges, we first propose Laplacian Anomaly Detection (LAD) which uses the spectrum of graph Laplacian as the low dimensional embedding of the graph structure at each snapshot. LAD explicitly models short-term and long-term dependencies by applying two sliding windows. Next, we propose MultiLAD, a simple and effective generalization of LAD to multi-view graphs. MultiLAD provides the first change point detection method for multi-view dynamic graphs. It aggregates the singular values of the normalized graph Laplacian from different views through the scalar power mean operation. Through extensive synthetic experiments, we show that (i) LAD and MultiLAD are accurate and outperforms state-of-the-art baselines and their multi-view extensions by a large margin, (ii) MultiLAD’s advantage over contenders significantly increases when additional views are available, and (iii) MultiLAD is highly robust to noise from individual views. In five real-world dynamic graphs, we demonstrate that LAD and MultiLAD identify significant events as top anomalies such as the implementation of government COVID-19 interventions which impacted the population mobility in multi-view traffic networks.

2024-01-12

ACM Transactions on Knowledge Discovery from Data (published)

Personalized inference for neurostimulation with meta-learning: a case study of vagus nerve stimulation

Ximeng Mao

Yao-Chuan Chang

Stavros Zanos

Guillaume Lajoie

2024-01-12

Journal of Neural Engineering (published)

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab

Timothée Darcet

Théo Moutakanni

Huy V. Vo

Marc Szafraniec

Vasil Khalidov

Pierre Fernandez

Daniel HAZIZA

Francisco Massa

Alaaeldin El-Nouby

Mahmoud Assran

Nicolas Ballas

Wojciech Galuba

Russell Howes

Po-Yao Huang

Shang-Wen Li

Ishan Misra

Michael Rabbat

Vasu Sharma

Gabriel Synnaeve … (see 7 more)

Hu Xu

Huijiao Xu

Herve Jegou

Julien Mairal

Patrick Labatut

Armand Joulin

Piotr Bojanowski

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (see more)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.

2024-01-11

TMLR (accepted)