Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
Shenyang Huang
Joao Alex Cunha
Zhiyi Li
Gabriela Moisescu-Pareja
Oleksandr Dymov
Samuel Maddrell-Mander
Callum McLean
Frederik Wenkel
Luis Müller
Jama Hussein Mohamud
Ali Parviz
Michael Craig
Michał Koziarski
Jiarui Lu
Zhaocheng Zhu
Cristian Gabellini
Kerstin Klaser
Josef Dean
Cas Wognum … (see 15 more)
Maciej Sypetkowski
Christopher Morris
Ioannis Koutis
Prudencio Tossou
Hadrien Mary
Therence Bois
Andrew William Fitzgibbon
Blazej Banaszewski
Chad Martin
Dominic Masters
Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.
Tree Cross Attention
Leo Feng
Frederick Tung
Hossein Hajimirsadeghi
Mohamed Osama Ahmed
Cross Attention is a popular method for retrieving information from a set of context tokens for making predictions. At inference time, for e… (see more)ach prediction, Cross Attention scans the full set of
Würstchen: An Efficient Architecture for Large-Scale Text-to-Image Diffusion Models
Pablo Pernias
Dominic Rampas
Mats Leon Richter
Marc Aubreville
BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.
Kim A. Tran
Erwan Pernet
Mina Sadeghi
Jeffrey Downey
Julia Chronopoulos
Elizabeth Lapshina
Oscar Tsai
Eva Kaufmann
Maziar Divangahi
BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.
Kim A. Tran
Erwan Pernet
Mina Sadeghi
Jeffrey Downey
Julia Chronopoulos
Elizabeth Lapshina
Oscar Tsai
Eva Kaufmann
Maziar Divangahi
BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.
Kim A. Tran
Erwan Pernet
Mina Sadeghi
Jeffrey Downey
Julia Chronopoulos
Elizabeth Lapshina
Oscar Tsai
Eva Kaufmann
Maziar Divangahi
BCG immunization induces CX3CR1hi effector memory T cells to provide cross-protection via IFN-γ-mediated trained immunity.
Kim A. Tran
Erwan Pernet
Mina Sadeghi
Jeffrey Downey
Julia Chronopoulos
Elizabeth Lapshina
Oscar Tsai
Eva Kaufmann
Maziar Divangahi
Computational pathology: A survey review and the way forward
Mahdi S. Hosseini
Babak Ehteshami Bejnordi
Vincent Quoc-Huy Trinh
Danial Hasan
Xingwen Li
Taehyo Kim
Haochen Zhang
Theodore Wu
Kajanan Chinniah
Sina Maghsoudlou
Ryan Zhang
Stephen Yang
Jiadai Zhu
Lyndon Chan
Samir Khaki
Andrei Buin
Fatemeh Chaji
Ala Salehi
Alejandra Zambrano Luna
Bich Ngoc Nguyen … (see 2 more)
Dimitris Samaras
Konstantinos N. Plataniotis
Assessing the quality and value of metabolic chart data for capturing core outcomes for pediatric medium-chain acyl-CoA dehydrogenase (MCAD) deficiency
Ryan Iverson
Monica Taljaard
Michael T. Geraghty
Michael Pugliese
Kylie Tingley
Doug Coyle
Jonathan B. Kronick
Kumanan Wilson
Valerie Austin
Catherine Brunel-Guitton
Daniela Buhas
Nancy J. Butcher
Alicia K. J. Chan
Sarah Dyack
Sharan Goobie
Cheryl Greenberg
Shailly Jain-Ghai
Michal Inbar-Feigenberg
Natalya Karp
Mariya Kozenko … (see 30 more)
Erica Langley
Matthew Lines
Julian Little
Jennifer MacKenzie
Bruno Maranda
Saadet Mercimek-Andrews
Aizeddin Mhanni
John J. Mitchell
Laura Nagy
Martin Offringa
Amy Pender
Murray Potter
Chitra Prasad
Suzanne Ratko
Ramona Salvarinova
Andreas Schulze
Komudi Siriwardena
Neal Sondheimer
Rebecca Sparkes
Sylvia Stockler-Ipsiroglu
Kendra Tapscott
Lesley Turner
Clara Van Karnebeek
Anthony Vandersteen
Jagdeep S. Walia
Brenda J. Wilson
Andrea C. Yu
Beth K. Potter
Pranesh Chakraborty
Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation
Mauricio Rivera
Kellin Pelrine
Large Language Models have emerged as prime candidates to tackle misinformation mitigation. However, existing approaches struggle with hallu… (see more)cinations and overconfident predictions. We propose an uncertainty quantification framework that leverages both direct confidence elicitation and sampled-based consistency methods to provide better calibration for NLP misinformation mitigation solutions. We first investigate the calibration of sample-based consistency methods that exploit distinct features of consistency across sample sizes and stochastic levels. Next, we evaluate the performance and distributional shift of a robust numeric verbalization prompt across single vs. two-step confidence elicitation procedure. We also compare the performance of the same prompt with different versions of GPT and different numerical scales. Finally, we combine the sample-based consistency and verbalized methods to propose a hybrid framework that yields a better uncertainty estimation for GPT models. Overall, our work proposes novel uncertainty quantification methods that will improve the reliability of Large Language Models in misinformation mitigation applications.
Combining Confidence Elicitation and Sample-based Methods for Uncertainty Quantification in Misinformation Mitigation
Mauricio Rivera
Kellin Pelrine
Comparing GPT-4 and Open-Source Language Models in Misinformation Mitigation
Tyler Vergho
Kellin Pelrine
Recent large language models (LLMs) have been shown to be effective for misinformation detection. However, the choice of LLMs for experiment… (see more)s varies widely, leading to uncertain conclusions. In particular, GPT-4 is known to be strong in this domain, but it is closed source, potentially expensive, and can show instability between different versions. Meanwhile, alternative LLMs have given mixed results. In this work, we show that Zephyr-7b presents a consistently viable alternative, overcoming key limitations of commonly used approaches like Llama-2 and GPT-3.5. This provides the research community with a solid open-source option and shows open-source models are gradually catching up on this task. We then highlight how GPT-3.5 exhibits unstable performance, such that this very widely used model could provide misleading results in misinformation detection. Finally, we validate new tools including approaches to structured output and the latest version of GPT-4 (Turbo), showing they do not compromise performance, thus unlocking them for future research and potentially enabling more complex pipelines for misinformation mitigation.