Portrait of Prudencio Tossou is unavailable

Prudencio Tossou

Collaborating researcher - Valence


Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Shenyang Huang
Joao Alex Cunha
Zhiyi Li
Gabriela Moisescu-Pareja
Oleksandr Dymov
Samuel Maddrell-Mander
Callum McLean
Frederik Wenkel
Luis Müller
Jama Hussein Mohamud
Ali Parviz
Michael Craig
Michał Koziarski
Jiarui Lu
Zhaocheng Zhu
Cristian Gabellini
Kerstin Klaser
Josef Dean
Cas Wognum … (see 15 more)
Maciej Sypetkowski
Christopher Morris
Ioannis Koutis
Prudencio Tossou
Hadrien Mary
Therence Bois
Andrew William Fitzgibbon
Blazej Banaszewski
Chad Martin
Dominic Masters
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.
Role of Structural and Conformational Diversity for Machine Learning Potentials
Nikhil Shenoy
Prudencio Tossou
Emmanuel Noutahi
Hadrien Mary
Jiarui Ding
In the field of Machine Learning Interatomic Potentials (MLIPs), understanding the intricate relationship between data biases, specifically … (see more)conformational and structural diversity, and model generalization is critical in improving the quality of Quantum Mechanics (QM) data generation efforts. We investigate these dynamics through two distinct experiments: a fixed budget one, where the dataset size remains constant, and a fixed molecular set one, which focuses on fixed structural diversity while varying conformational diversity. Our results reveal nuanced patterns in generalization metrics. Notably, for optimal structural and conformational generalization, a careful balance between structural and conformational diversity is required, but existing QM datasets do not meet that trade-off. Additionally, our results highlight the limitation of the MLIP models at generalizing beyond their training distribution, emphasizing the importance of defining applicability domain during model deployment. These findings provide valuable insights and guidelines for QM data generation efforts.
MOT: A Multi-Omics Transformer for Multiclass Classification Tumour Types Predictions
Mazid Osseni
Prudencio Tossou
François Laviolette
Jacques Corbeil