Publications

An Empirical Study of Retrieval-Enhanced Graph Neural Networks

Dingmin Wang

Shengchao Liu

Hanchen Wang

Bernardo Cuenca Grau

Linfeng Song

Jian Tang

Le Song

Qi Liu

Graph Neural Networks (GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation s… (voir plus)cheme, named message passing, thereby their theoretical expressive power is limited to the first-order Weisfeiler-Lehman test (1-WL). An effective approach to this challenge is to explicitly retrieve some annotated examples used to enhance GNN models. While retrieval-enhanced models have been proved to be effective in many language and vision domains, it remains an open question how effective retrieval-enhanced GNNs are when applied to graph datasets. Motivated by this, we want to explore how the retrieval idea can help augment the useful information learned in the graph neural networks, and we design a retrieval-enhanced scheme called GRAPHRETRIEVAL, which is agnostic to the choice of graph neural network models. In GRAPHRETRIEVAL, for each input graph, similar graphs together with their ground-true labels are retrieved from an existing database. Thus they can act as a potential enhancement to complete various graph property predictive tasks. We conduct comprehensive experiments over 13 datasets, and we observe that GRAPHRETRIEVAL is able to reach substantial improvements over existing GNNs. Moreover, our empirical study also illustrates that retrieval enhancement is a promising remedy for alleviating the long-tailed label distribution problem.

2023-09-27

Frontiers in Artificial Intelligence and Applications (publié)

doi.org

arxiv.org

Influence of preprocessing, distortion correction and cardiac triggering on the quality of diffusion MR images of spinal cord

Kurt G. Schilling

Anna Combes

Karthik Ramadass

François Rheault

Grace Sweeney

Logan Prock

Subramaniam Sriram

Julien Cohen‐Adad

John C. Gore

Bennett A. Landman

Seth A. Smith

Kristin P. O’Grady

Diffusion MRI of the spinal cord (SC) is susceptible to geometric distortion caused by field inhomogeneities, and prone to misalignment acro… (voir plus)ss time series and signal dropout caused by biological motion. Several modifications of image acquisition and image processing techniques have been introduced to overcome these artifacts, but their specific benefits are largely unproven and warrant further investigations. We aim to evaluate two specific aspects of image acquisition and processing that address image quality in diffusion studies of the spinal cord: susceptibility corrections to reduce geometric distortions, and cardiac triggering to minimize motion artifacts. First, we evaluate 4 distortion preprocessing strategies on 7 datasets of the cervical and lumbar SC and find that while distortion correction techniques increase geometric similarity to structural images, they are largely driven by the high-contrast cerebrospinal fluid, and do not consistently improve the geometry within the cord nor improve white-to-gray matter contrast. We recommend at a minimum to perform bulk-motion correction in preprocessing and posit that improvements/adaptations are needed for spinal cord distortion preprocessing algorithms, which are currently optimized and designed for brain imaging. Second, we design experiments to evaluate the impact of removing cardiac triggering. We show that when triggering is foregone, images are qualitatively similar to triggered sequences, do not have increased prevalence of artifacts, and result in similar diffusion tensor indices with similar reproducibility to triggered acquisitions. When triggering is removed, much shorter acquisitions are possible, which are also qualitatively and quantitatively similar to triggered sequences. We suggest that removing cardiac triggering for cervical SC diffusion can be a reasonable option to save time with minimal sacrifice to image quality.

2023-09-26

bioRxiv (prépublication)

doi.org

Time Delay Cosmography with a Neural Ratio Estimator

Ève Campeau-Poirier

Laurence Perreault-Levasseur

Adam Coogan

Yashar Hezaveh

We explore the use of a Neural Ratio Estimator (NRE) to determine the Hubble constant (…

2023-09-26

ArXiv (prépublication)

arxiv.org

ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning

Julia Kaltenborn

Charlotte Emilie Elektra Lange

Venkatesh Ramesh

Philippe Brouillard

Yaniv Gurwicz

Chandni Nagda

Jakob Runge

Peer Nowack

David Rolnick

Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) c… (voir plus)ommunity has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Hanchen Wang

Jean Kaddour

Shengchao Liu

Jian Tang

Matt J. Kusner

Joan Lasenby

Qi Liu

Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries… (voir plus) profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present"Molecular Graph Representation Evaluation"(MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

GEO-Bench: Toward Foundation Models for Earth Monitoring

Alexandre Lacoste

Nils Lehmann

Pau Rodríguez

Evan David Sherwin

Hannah Kerner

Björn Lütjens

Jeremy Irvin

David Dao

Hamed Alemohammad

Mehmet Gunturkun

Dava Newman

Stefano Ermon

Xiao Xiang Zhu

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to subst… (voir plus)antial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks

Maxime Chevalier-Boisvert

Bolun Dai

Mark Towers

Rodrigo De Lazcano Perez-Vicente

Lucas Willems

Salem Lahlou

Suman Pal

Pablo Samuel Castro

J K Terry

We present the Minigrid and Miniworld libraries which provide a suite of goal-oriented 2D and 3D environments. The libraries were explicitly… (voir plus) created with a minimalistic design paradigm to allow users to rapidly develop new environments for a wide range of research-specific needs. As a result, both have received widescale adoption by the RL community, facilitating research in a wide range of areas. In this paper, we outline the design philosophy, environment details, and their world generation API. We also showcase the additional capabilities brought by the unified API between Minigrid and Miniworld through case studies on transfer learning (for both RL agents and humans) between the different observation spaces. The source code of Minigrid and Miniworld can be found at https://github.com/Farama-Foundation/Minigrid and https://github.com/Farama-Foundation/Miniworld along with their documentation at https://minigrid.farama.org/ and https://miniworld.farama.org/.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Florian Bordes

Shashank Shekhar

Mark Ibrahim

Diane Bouchacourt

Pascal Vincent

Ari S. Morcos

Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (voir plus)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation.Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear.In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. Using PUG for evaluation and fine-tuning, we demonstrate the potential of PUG to both enable more rigorous evaluations and to improve model training.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials

Shengchao Liu

Weitao Du

Yanjing Li

Zhuoxinran Li

Zhiling Zheng

Chenru Duan

Zhiming Ma

Omar Yaghi

Anima Anandkumar

Christian Borgs

Jennifer Chayes

Hongyu Guo

Jian Tang

Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific comm… (voir plus)unities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science (e.g., physics, chemistry, & biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 46 diverse datasets, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

Temporal Graph Benchmark for Machine Learning on Temporal Graphs

Shenyang Huang

Farimah Poursafaei

Jacob Danovitch

Matthias Fey

Weihua Hu

Emanuele Rossi

Jure Leskovec

Michael M. Bronstein

Guillaume Rabusseau

Reihaneh Rabbany

We present the Temporal Graph Benchmark (TGB), a collection of challenging and diverse benchmark datasets for realistic, reproducible, and r… (voir plus)obust evaluation of machine learning models on temporal graphs. TGB datasets are of large scale, spanning years in duration, incorporate both node and edge-level prediction tasks and cover a diverse set of domains including social, trade, transaction, and transportation networks. For both tasks, we design evaluation protocols based on realistic use-cases. We extensively benchmark each dataset and find that the performance of common models can vary drastically across datasets. In addition, on dynamic node property prediction tasks, we show that simple methods often achieve superior performance compared to existing temporal graph models. We believe that these findings open up opportunities for future research on temporal graphs. Finally, TGB provides an automated machine learning pipeline for reproducible and accessible temporal graph research, including data loading, experiment setup and performance evaluation. TGB will be maintained and updated on a regular basis and welcomes community feedback. TGB datasets, data loaders, example codes, evaluation setup, and leaderboards are publicly available at https://tgb.complexdatalab.com/.

2023-09-24

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

Substituting Data Annotation with Balanced Updates and Collective Loss in Multi-label Text Classification

Muberra Ozmen

Joseph Cotnareanu

Mark J. Coates

Multi-label text classification (MLTC) is the task of assigning multiple labels to a given text, and has a wide range of application domains… (voir plus). Most existing approaches require an enormous amount of annotated data to learn a classifier and/or a set of well-defined constraints on the label space structure, such as hierarchical relations which may be complicated to provide as the number of labels increases. In this paper, we study the MLTC problem in annotation-free and scarce-annotation settings in which the magnitude of available supervision signals is linear to the number of labels. Our method follows three steps, (1) mapping input text into a set of preliminary label likelihoods by natural language inference using a pre-trained language model, (2) calculating a signed label dependency graph by label descriptions, and (3) updating the preliminary label likelihoods with message passing along the label dependency graph, driven with a collective loss function that injects the information of expected label frequency and average multi-label cardinality of predictions. The experiments show that the proposed framework achieves effective performance under low supervision settings with almost imperceptible computational and memory overheads added to the usage of pre-trained language model outperforming its initial performance by 70\% in terms of example-based F1 score.

2023-09-23

ArXiv (prépublication)

doi.org

arxiv.org

Disorganized Communication and Social Dysfunction in Schizophrenia: Emerging Concepts and Methods

Emmanuel Olarewaju

Guillaume Dumas

L. Palaniyappan

2023-09-22

Current Psychiatry Reports (publié)

doi.org

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Publications

Mila sur Udemy

Désinformation 2.0 : quand l’IA brouille nos ondes

Publications du Fellowship en politiques de l'IA

Mots-clés populaires:

Publications