Publications

An Empirical Study of Retrieval-Enhanced Graph Neural Networks
Dingmin Wang
Hanchen Wang
Bernardo Cuenca Grau
Linfeng Song
Le Song
Qi Liu
Graph Neural Networks (GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation s… (voir plus)cheme, named message passing, thereby their theoretical expressive power is limited to the first-order Weisfeiler-Lehman test (1-WL). An effective approach to this challenge is to explicitly retrieve some annotated examples used to enhance GNN models. While retrieval-enhanced models have been proved to be effective in many language and vision domains, it remains an open question how effective retrieval-enhanced GNNs are when applied to graph datasets. Motivated by this, we want to explore how the retrieval idea can help augment the useful information learned in the graph neural networks, and we design a retrieval-enhanced scheme called GRAPHRETRIEVAL, which is agnostic to the choice of graph neural network models. In GRAPHRETRIEVAL, for each input graph, similar graphs together with their ground-true labels are retrieved from an existing database. Thus they can act as a potential enhancement to complete various graph property predictive tasks. We conduct comprehensive experiments over 13 datasets, and we observe that GRAPHRETRIEVAL is able to reach substantial improvements over existing GNNs. Moreover, our empirical study also illustrates that retrieval enhancement is a promising remedy for alleviating the long-tailed label distribution problem.
Influence of preprocessing, distortion correction and cardiac triggering on the quality of diffusion MR images of spinal cord
Kurt G. Schilling
Anna Combes
Karthik Ramadass
François Rheault
Grace Sweeney
Logan Prock
Subramaniam Sriram
Julien Cohen‐Adad
John C. Gore
Bennett A. Landman
Seth A. Smith
Kristin P. O’Grady
Diffusion MRI of the spinal cord (SC) is susceptible to geometric distortion caused by field inhomogeneities, and prone to misalignment acro… (voir plus)ss time series and signal dropout caused by biological motion. Several modifications of image acquisition and image processing techniques have been introduced to overcome these artifacts, but their specific benefits are largely unproven and warrant further investigations. We aim to evaluate two specific aspects of image acquisition and processing that address image quality in diffusion studies of the spinal cord: susceptibility corrections to reduce geometric distortions, and cardiac triggering to minimize motion artifacts. First, we evaluate 4 distortion preprocessing strategies on 7 datasets of the cervical and lumbar SC and find that while distortion correction techniques increase geometric similarity to structural images, they are largely driven by the high-contrast cerebrospinal fluid, and do not consistently improve the geometry within the cord nor improve white-to-gray matter contrast. We recommend at a minimum to perform bulk-motion correction in preprocessing and posit that improvements/adaptations are needed for spinal cord distortion preprocessing algorithms, which are currently optimized and designed for brain imaging. Second, we design experiments to evaluate the impact of removing cardiac triggering. We show that when triggering is foregone, images are qualitatively similar to triggered sequences, do not have increased prevalence of artifacts, and result in similar diffusion tensor indices with similar reproducibility to triggered acquisitions. When triggering is removed, much shorter acquisitions are possible, which are also qualitatively and quantitatively similar to triggered sequences. We suggest that removing cardiac triggering for cervical SC diffusion can be a reasonable option to save time with minimal sacrifice to image quality.
Time Delay Cosmography with a Neural Ratio Estimator
We explore the use of a Neural Ratio Estimator (NRE) to determine the Hubble constant (…
ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning
Charlotte Emilie Elektra Lange
Yaniv Gurwicz
Jakob Runge
Peer Nowack
Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) c… (voir plus)ommunity has taken an increased interest in supporting climate scientists' efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a "super emulator" can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.
Evaluating Self-Supervised Learning for Molecular Graph Embeddings
Hanchen Wang
Jean Kaddour
Matt J. Kusner
Joan Lasenby
Qi Liu
Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries… (voir plus) profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present"Molecular Graph Representation Evaluation"(MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape.
GEO-Bench: Toward Foundation Models for Earth Monitoring
Alexandre Lacoste
Nils Lehmann
Pau Rodríguez
Evan David Sherwin
Hannah Kerner
Björn Lütjens
Jeremy Irvin
David Dao
Hamed Alemohammad
Mehmet Gunturkun
Dava Newman
Stefano Ermon
Xiao Xiang Zhu
Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to subst… (voir plus)antial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.
Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks
Maxime Chevalier-Boisvert
Bolun Dai
Mark Towers
Rodrigo De Lazcano Perez-Vicente
Suman Pal
J K Terry
We present the Minigrid and Miniworld libraries which provide a suite of goal-oriented 2D and 3D environments. The libraries were explicitly… (voir plus) created with a minimalistic design paradigm to allow users to rapidly develop new environments for a wide range of research-specific needs. As a result, both have received widescale adoption by the RL community, facilitating research in a wide range of areas. In this paper, we outline the design philosophy, environment details, and their world generation API. We also showcase the additional capabilities brought by the unified API between Minigrid and Miniworld through case studies on transfer learning (for both RL agents and humans) between the different observation spaces. The source code of Minigrid and Miniworld can be found at https://github.com/Farama-Foundation/Minigrid and https://github.com/Farama-Foundation/Miniworld along with their documentation at https://minigrid.farama.org/ and https://miniworld.farama.org/.
PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning
Shashank Shekhar
Mark Ibrahim
Diane Bouchacourt
Ari S. Morcos
Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (voir plus)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation.Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear.In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. Using PUG for evaluation and fine-tuning, we demonstrate the potential of PUG to both enable more rigorous evaluations and to improve model training.
Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials
Weitao Du
Yanjing Li
Zhuoxinran Li
Zhiling Zheng
Chenru Duan
Zhiming Ma
Omar Yaghi
Anima Anandkumar
Christian Borgs
Jennifer Chayes
Hongyu Guo
Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific comm… (voir plus)unities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science (e.g., physics, chemistry, & biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 46 diverse datasets, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications.
Temporal Graph Benchmark for Machine Learning on Temporal Graphs
Matthias Fey
Weihua Hu
Emanuele Rossi
Jure Leskovec
Michael M. Bronstein
We present the Temporal Graph Benchmark (TGB), a collection of challenging and diverse benchmark datasets for realistic, reproducible, and r… (voir plus)obust evaluation of machine learning models on temporal graphs. TGB datasets are of large scale, spanning years in duration, incorporate both node and edge-level prediction tasks and cover a diverse set of domains including social, trade, transaction, and transportation networks. For both tasks, we design evaluation protocols based on realistic use-cases. We extensively benchmark each dataset and find that the performance of common models can vary drastically across datasets. In addition, on dynamic node property prediction tasks, we show that simple methods often achieve superior performance compared to existing temporal graph models. We believe that these findings open up opportunities for future research on temporal graphs. Finally, TGB provides an automated machine learning pipeline for reproducible and accessible temporal graph research, including data loading, experiment setup and performance evaluation. TGB will be maintained and updated on a regular basis and welcomes community feedback. TGB datasets, data loaders, example codes, evaluation setup, and leaderboards are publicly available at https://tgb.complexdatalab.com/.
Substituting Data Annotation with Balanced Updates and Collective Loss in Multi-label Text Classification
Muberra Ozmen
Mark J. Coates
Multi-label text classification (MLTC) is the task of assigning multiple labels to a given text, and has a wide range of application domains… (voir plus). Most existing approaches require an enormous amount of annotated data to learn a classifier and/or a set of well-defined constraints on the label space structure, such as hierarchical relations which may be complicated to provide as the number of labels increases. In this paper, we study the MLTC problem in annotation-free and scarce-annotation settings in which the magnitude of available supervision signals is linear to the number of labels. Our method follows three steps, (1) mapping input text into a set of preliminary label likelihoods by natural language inference using a pre-trained language model, (2) calculating a signed label dependency graph by label descriptions, and (3) updating the preliminary label likelihoods with message passing along the label dependency graph, driven with a collective loss function that injects the information of expected label frequency and average multi-label cardinality of predictions. The experiments show that the proposed framework achieves effective performance under low supervision settings with almost imperceptible computational and memory overheads added to the usage of pre-trained language model outperforming its initial performance by 70\% in terms of example-based F1 score.
Disorganized Communication and Social Dysfunction in Schizophrenia: Emerging Concepts and Methods
Emmanuel Olarewaju
L. Palaniyappan