Publications

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Joao Alex Cunha
Zhiyi Li
Samuel Maddrell-Mander
Callum McLean
Jama Hussein Mohamud
Michael Craig
Cristian Gabellini
Kerstin Klasers
Josef Dean
Maciej Sypetkowski
Ioannis Koutis
Hadrien Mary
Therence Bois
Andrew Fitzgibbon
Błażej Banaszewski
Chad Martin
Dominic Masters
Recently, pre-trained foundation models have shown significant advancements in multiple fields. However, the lack of datasets with labeled f… (voir plus)eatures and codebases has hindered the development of a supervised foundation model for molecular tasks. Here, we have carefully curated seven datasets specifically tailored for node- and graph-level prediction tasks to facilitate supervised learning on molecules. Moreover, to support the development of multi-task learning on our proposed datasets, we created the Graphium graph machine learning library. Our dataset collection encompasses two distinct categories. Firstly, the TOYMIX category modifies three small existing datasets with additional data for multi-task learning. Secondly, the LARGEMIX category includes four large-scale datasets with 344M graph-level data points and 409M node-level data points from ∼5M unique molecules. Finally, the ultra-large dataset contains 2,210M graph-level data points and 2,031M node-level data points coming from 86M molecules. Hence our datasets represent an order of magnitude increase in data volume compared to other 2D-GNN datasets. In addition, recognizing that molecule-related tasks often span multiple levels, we have designed our library to explicitly support multi-tasking, offering a diverse range of multi-level representations, i.e., representations at the graph, node, edge, and node-pair level. We equipped the library with an extensive collection of models and features to cover different levels of molecule analysis. By combining our curated datasets with this versatile library, we aim to accelerate the development of molecule foundation models. Datasets and code are available at https://github.com/datamol-io/graphium.
BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.
Kim A. Tran
Erwan Pernet
Mina Sadeghi
Jeffrey Downey
Julia Chronopoulos
Elizabeth Lapshina
Oscar Tsai
Eva Kaufmann
Maziar Divangahi
Self-Supervised Anomaly Detection: A Survey and Outlook
Thi Kieu Khanh Ho
Naregs Armanfard
Anomaly detection (AD) plays a crucial role in various domains, including cybersecurity, finance, and healthcare, by identifying patterns or… (voir plus) events that deviate from normal behaviour. In recent years, significant progress has been made in this field due to the remarkable growth of deep learning models. Notably, the advent of self-supervised learning has sparked the development of novel AD algorithms that outperform the existing state-of-the-art approaches by a considerable margin. This paper aims to provide a comprehensive review of the current methodologies in self-supervised anomaly detection. We present technical details of the standard methods and discuss their strengths and drawbacks. We also compare the performance of these models against each other and other state-of-the-art anomaly detection models. Finally, the paper concludes with a discussion of future directions for self-supervised anomaly detection, including the development of more effective and efficient algorithms and the integration of these techniques with other related fields, such as multi-modal learning.
Computational pathology: A survey review and the way forward
Mahdi S. Hosseini
Babak Ehteshami Bejnordi
Vincent Quoc-Huy Trinh
Danial Hasan
Xingwen Li
Taehyo Kim
Haochen Zhang
Theodore Wu
Kajanan Chinniah
Sina Maghsoudlou
Ryan Zhang
Stephen Yang
Jiadai Zhu
Lyndon Chan
Samir Khaki
Andrei Buin
Fatemeh Chaji
Ala Salehi
Alejandra Zambrano Luna
Bich Ngoc Nguyen … (voir 2 de plus)
Dimitris Samaras
Konstantinos N. Plataniotis
Assessing the quality and value of metabolic chart data for capturing core outcomes for pediatric medium-chain acyl-CoA dehydrogenase (MCAD) deficiency
Ryan Iverson
Monica Taljaard
Michael T. Geraghty
Michael Pugliese
Kylie Tingley
Doug Coyle
Jonathan B. Kronick
Kumanan Wilson
Valerie Austin
Catherine Brunel-Guitton
Daniela Buhas
Nancy J. Butcher
Alicia K. J. Chan
Sarah Dyack
Sharan Goobie
Cheryl Greenberg
Shailly Jain-Ghai
Michal Inbar-Feigenberg
Natalya Karp
Mariya Kozenko … (voir 30 de plus)
Erica Langley
Matthew Lines
Julian Little
Jennifer MacKenzie
Bruno Maranda
Saadet Mercimek-Andrews
Aizeddin Mhanni
John J. Mitchell
Laura Nagy
Martin Offringa
Amy Pender
Murray Potter
Chitra Prasad
Suzanne Ratko
Ramona Salvarinova
Andreas Schulze
Komudi Siriwardena
Neal Sondheimer
Rebecca Sparkes
Sylvia Stockler-Ipsiroglu
Kendra Tapscott
Lesley Turner
Clara Van Karnebeek
Anthony Vandersteen
Jagdeep S. Walia
Brenda J. Wilson
Andrea C. Yu
Beth K. Potter
Pranesh Chakraborty
Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation
Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation
Recent large language models (LLMs) have been shown to be effective for misinformation detection. However, the choice of LLMs for experiment… (voir plus)s varies widely, leading to uncertain conclusions. In particular, GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. Meanwhile, alternative LLMs have given mixed results. In this work, we show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like Llama-2 and GPT-3.5. This provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. Finally, we validate new tools including approaches to structured output and the latest version of GPT-4 (Turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.
Personalized inference for neurostimulation with meta-learning: a case study of vagus nerve stimulation
Yao-Chuan Chang
Stavros Zanos
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab
Timothée Darcet
Théo Moutakanni
Huy V. Vo
Marc Szafraniec
Vasil Khalidov
Pierre Fernandez
Daniel HAZIZA
Francisco Massa
Alaaeldin El-Nouby
Mahmoud Assran
Wojciech Galuba
Russell Howes
Po-Yao Huang
Shang-Wen Li
Ishan Misra
Michael G. Rabbat
Vasu Sharma
Gabriel Synnaeve … (voir 8 de plus)
Hu Xu 0001
Huijiao Xu
Hu Xu
Herve Jegou
Julien Mairal
Patrick Labatut
Armand Joulin
Piotr Bojanowski
The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar fo… (voir plus)undation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP on most of the benchmarks at image and pixel levels.
A database of the healthy human spinal cord morphometry in the PAM50 template space
Miloš Keřkovský
Tomáš Rohan
Measures of spinal cord morphometry computed from magnetic resonance images serve as relevant prognostic biomarkers for a range of spinal co… (voir plus)rd pathologies, including traumatic and non-traumatic spinal cord injury and neurodegenerative diseases. However, interpreting these imaging biomarkers is difficult due to considerable intra- and inter-subject variability. Yet, there is no clear consensus on a normalization method that would help reduce this variability and more insights into the distribution of these morphometrics are needed. In this study, we computed a database of normative values for six commonly used measures of spinal cord morphometry: cross-sectional area, anteroposterior diameter, transverse diameter, compression ratio, eccentricity, and solidity. Normative values were computed from a large open-access dataset of healthy adult volunteers (N = 203) and were brought to the common space of the PAM50 spinal cord template using a newly proposed normalization method based on linear interpolation. Compared to traditional image-based registration, the proposed normalization approach does not involve image transformations and, therefore, does not introduce distortions of spinal cord anatomy. This is a crucial consideration in preserving the integrity of the spinal cord anatomy in conditions such as spinal cord injury. This new morphometric database allows researchers to normalize based on sex and age, thereby minimizing inter-subject variability associated with demographic and biological factors. The proposed methodology is open-source and accessible through the Spinal Cord Toolbox (SCT) v6.0 and higher.
Nonparametric Partial Disentanglement via Mechanism Sparsity: Sparse Actions, Interventions and Sparse Temporal Dependencies
Pau Rodríguez
Yash Sharma
Rémi Le Priol
Alexandre Lacoste
DyG2Vec: Efficient Representation Learning for Dynamic Graphs
Mohammad Alomrani
Mahdi Biparva
Yingxue Zhang
Mark J. Coates
Temporal graph neural networks have shown promising results in learning inductive representations by automatically extracting temporal patte… (voir plus)rns. However, previous works often rely on complex memory modules or inefficient random walk methods to construct temporal representations. To address these limitations, we present an efficient yet effective attention-based encoder that leverages temporal edge encodings and window-based subgraph sampling to generate task-agnostic embeddings. Moreover, we propose a joint-embedding architecture using non-contrastive SSL to learn rich temporal embeddings without labels. Experimental results on 7 benchmark datasets indicate that on average, our model outperforms SoTA baselines on the future link prediction task by 4.23% for the transductive setting and 3.30% for the inductive setting while only requiring 5-10x less training/inference time. Lastly, different aspects of the proposed framework are investigated through experimental analysis and ablation studies. The code is publicly available at https://github.com/huawei-noah/noah-research/tree/master/graph_atlas.