Publications

ClimateSet: A Large-Scale Climate Model Dataset for Machine Learning

Julia Kaltenborn

Charlotte Emilie Elektra Lange

Venkatesh Ramesh

Philippe Brouillard

Yaniv Gurwicz

Chandni Nagda

Jakob Runge

Peer Nowack

David Rolnick

Climate models have been key for assessing the impact of climate change and simulating future climate scenarios. The machine learning (ML) c… (see more)ommunity has taken an increased interest in supporting climate scientists’ efforts on various tasks such as climate model emulation, downscaling, and prediction tasks. Many of those tasks have been addressed on datasets created with single climate models. However, both the climate science and ML communities have suggested that to address those tasks at scale, we need large, consistent, and ML-ready climate model datasets. Here, we introduce ClimateSet, a dataset containing the inputs and outputs of 36 climate models from the Input4MIPs and CMIP6 archives. In addition, we provide a modular dataset pipeline for retrieving and preprocessing additional climate models and scenarios. We showcase the potential of our dataset by using it as a benchmark for ML-based climate model emulation. We gain new insights about the performance and generalization capabilities of the different ML models by analyzing their performance across different climate models. Furthermore, the dataset can be used to train an ML emulator on several climate models instead of just one. Such a “super-emulator” can quickly project new climate change scenarios, complementing existing scenarios already provided to policymakers. We believe ClimateSet will create the basis needed for the ML community to tackle climate-related tasks at scale.

openreview.net

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Hanchen Wang

Jean Kaddour

Shengchao Liu

Jian Tang

Matt J. Kusner

Joan Lasenby

Qi Liu

Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries… (see more) profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present"Molecular Graph Representation Evaluation"(MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape.

openreview.net

GEO-Bench: Toward Foundation Models for Earth Monitoring

Alexandre Lacoste

Nils Lehmann

Pau Rodriguez

Evan David Sherwin

Hannah Kerner

Björn Lütjens

Jeremy Andrew Irvin

David Dao

Hamed Alemohammad

Alexandre Drouin

Mehmet Gunturkun

Gabriel Huang

David Vazquez

Dava Newman

Yoshua Bengio

Stefano Ermon

Xiao Xiang Zhu

Recent progress in self-supervision has shown that pre-training large neural networks on vast amounts of unsupervised data can lead to subst… (see more)antial increases in generalization to downstream tasks. Such models, recently coined foundation models, have been transformational to the field of natural language processing. Variants have also been proposed for image data, but their applicability to remote sensing tasks is limited. To stimulate the development of foundation models for Earth monitoring, we propose a benchmark comprised of six classification and six segmentation tasks, which were carefully curated and adapted to be both relevant to the field and well-suited for model evaluation. We accompany this benchmark with a robust methodology for evaluating models and reporting aggregated results to enable a reliable assessment of progress. Finally, we report results for 20 baselines to gain information about the performance of existing models. We believe that this benchmark will be a driver of progress across a variety of Earth monitoring tasks.

openreview.net

Minigrid & Miniworld: Modular & Customizable Reinforcement Learning Environments for Goal-Oriented Tasks

Maxime Chevalier-Boisvert

Bolun Dai

Mark Towers

Rodrigo De Lazcano Perez-Vicente

Lucas Willems

Salem Lahlou

Suman Pal

Pablo Samuel Castro

J K Terry

We present the Minigrid and Miniworld libraries which provide a suite of goal-oriented 2D and 3D environments. The libraries were explicitly… (see more) created with a minimalistic design paradigm to allow users to rapidly develop new environments for a wide range of research-specific needs. As a result, both have received widescale adoption by the RL community, facilitating research in a wide range of areas. In this paper, we outline the design philosophy, environment details, and their world generation API. We also showcase the additional capabilities brought by the unified API between Minigrid and Miniworld through case studies on transfer learning (for both RL agents and humans) between the different observation spaces. The source code of Minigrid and Miniworld can be found at https://github.com/Farama-Foundation/Minigrid and https://github.com/Farama-Foundation/Miniworld along with their documentation at https://minigrid.farama.org/ and https://miniworld.farama.org/.

2023-09-25

NeurIPS.cc/2023/Track/Datasets_and_Benchmarks (poster)

doi.org

openreview.net

PUG: Photorealistic and Semantically Controllable Synthetic Data for Representation Learning

Florian Bordes

Shashank Shekhar

Mark Ibrahim

Diane Bouchacourt

Pascal Vincent

Ari S. Morcos

Synthetic image datasets offer unmatched advantages for designing and evaluating deep neural networks: they make it possible to (i) render a… (see more)s many data samples as needed, (ii) precisely control each scene and yield granular ground truth labels (and captions), (iii) precisely control distribution shifts between training and testing to isolate variables of interest for sound experimentation. Despite such promise, the use of synthetic image data is still limited -- and often played down -- mainly due to their lack of realism. Most works therefore rely on datasets of real images, which have often been scraped from public images on the internet, and may have issues with regards to privacy, bias, and copyright, while offering little control over how objects precisely appear. In this work, we present a path to democratize the use of photorealistic synthetic data: we develop a new generation of interactive environments for representation learning research, that offer both controllability and realism. We use the Unreal Engine, a powerful game engine well known in the entertainment industry, to produce PUG (Photorealistic Unreal Graphics) environments and datasets for representation learning. In this paper, we demonstrate the potential of PUG to enable more rigorous evaluations of vision models.

openreview.net

SatBird: a Dataset for Bird Species Distribution Modeling using Remote Sensing and Citizen Science Data

Mélisande Teng

Amna Elmustafa

Benjamin Akera

Hager Radi

Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials

Shengchao Liu

weitao Du

Yanjing Li

Zhuoxinran Li

Zhiling Zheng

Chenru Duan

Zhi-Ming Ma

Omar M. Yaghi

Animashree Anandkumar

Christian Borgs

Jennifer T Chayes

Hongyu Guo

Jian Tang

Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific comm… (see more)unities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science ({\eg}, physics, chemistry, \& biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 52 diverse tasks, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications. The source code is available on \href{https://github.com/chao1224/Geom3D}{the GitHub repository}.

openreview.net

Temporal Graph Benchmark for Machine Learning on Temporal Graphs

Shenyang Huang

Farimah Poursafaei

Jacob Danovitch

Matthias Fey

Weihua Hu

Emanuele Rossi

Jure Leskovec

Michael M. Bronstein

Guillaume Rabusseau

Reihaneh Rabbany

We present the Temporal Graph Benchmark (TGB), a collection of challenging and diverse benchmark datasets for realistic, reproducible, and r… (see more)obust evaluation of machine learning models on temporal graphs. TGB datasets are of large scale, spanning years in duration, incorporate both node and edge-level prediction tasks and cover a diverse set of domains including social, trade, transaction, and transportation networks. For both tasks, we design evaluation protocols based on realistic use-cases. We extensively benchmark each dataset and find that the performance of common models can vary drastically across datasets. In addition, on dynamic node property prediction tasks, we show that simple methods often achieve superior performance compared to existing temporal graph models. We believe that these findings open up opportunities for future research on temporal graphs. Finally, TGB provides an automated machine learning pipeline for reproducible and accessible temporal graph research, including data loading, experiment setup and performance evaluation. TGB will be maintained and updated on a regular basis and welcomes community feedback. TGB datasets, data loaders, example codes, evaluation setup, and leaderboards are publicly available at https://tgb.complexdatalab.com/.

openreview.net

Substituting Data Annotation with Balanced Updates and Collective Loss in Multi-label Text Classification

Muberra Ozmen

Joseph Cotnareanu

Mark Coates

Multi-label text classification (MLTC) is the task of assigning multiple labels to a given text, and has a wide range of application domains… (see more). Most existing approaches require an enormous amount of annotated data to learn a classifier and/or a set of well-defined constraints on the label space structure, such as hierarchical relations which may be complicated to provide as the number of labels increases. In this paper, we study the MLTC problem in annotation-free and scarce-annotation settings in which the magnitude of available supervision signals is linear to the number of labels. Our method follows three steps, (1) mapping input text into a set of preliminary label likelihoods by natural language inference using a pre-trained language model, (2) calculating a signed label dependency graph by label descriptions, and (3) updating the preliminary label likelihoods with message passing along the label dependency graph, driven with a collective loss function that injects the information of expected label frequency and average multi-label cardinality of predictions. The experiments show that the proposed framework achieves effective performance under low supervision settings with almost imperceptible computational and memory overheads added to the usage of pre-trained language model outperforming its initial performance by 70\% in terms of example-based F1 score.

2023-09-24

ArXiv (preprint)

doi.org

arxiv.org

Disorganized Communication and Social Dysfunction in Schizophrenia: Emerging Concepts and Methods

Emmanuel Olarewaju

Guillaume Dumas

L. Palaniyappan

2023-09-23

Current Psychiatry Reports (published)

doi.org

Autonomic nervous system modulation during self-induced non-ordinary states of consciousness

Victor Oswald

Audrey Vanhaudenhuyse

Jitka Annen

Charlotte Martial

Aminata Bicego

Floriane Rousseaux

Corine Sombrun

Yann Harel

Marie-Elisabeth Faymonville

Steven Laureys

Karim Jerbi

Olivia Gosseries

2023-09-22

Scientific Reports (published)

doi.org

Additive Decoders for Latent Variables Identification and Cartesian-Product Extrapolation

Sébastien Lachapelle

Divyat Mahajan

Ioannis Mitliagkas

Simon Lacoste-Julien

We tackle the problems of latent variables identification and "out-of-support'' image generation in representation learning. We show that bo… (see more)th are possible for a class of decoders that we call additive, which are reminiscent of decoders used for object-centric representation learning (OCRL) and well suited for images that can be decomposed as a sum of object-specific images. We provide conditions under which exactly solving the reconstruction problem using an additive decoder is guaranteed to identify the blocks of latent variables up to permutation and block-wise invertible transformations. This guarantee relies only on very weak assumptions about the distribution of the latent factors, which might present statistical dependencies and have an almost arbitrarily shaped support. Our result provides a new setting where nonlinear independent component analysis (ICA) is possible and adds to our theoretical understanding of OCRL methods. We also show theoretically that additive decoders can generate novel images by recombining observed factors of variations in novel ways, an ability we refer to as Cartesian-product extrapolation. We show empirically that additivity is crucial for both identifiability and extrapolation on simulated data.

openreview.net

NLP in the era of generative AI, cognitive sciences, and societal transformation

AI Policy Compass

Student Life and Resources

Publications

NLP in the era of generative AI, cognitive sciences, and societal transformation

AI Policy Compass

Student Life and Resources

Popular keywords:

Publications