Publications

CVQA: Culturally-diverse Multilingual Visual Question Answering Benchmark

David LE MEUR

David Orlando Romero Mogrovejo

Chenyang Lyu

Haryo Akbarianto Wibowo

Teresa Lynn

Injy Hamed

Aditya Nanda Kishore Khandavally

Aishik Mandal

Alina Dragonetti

Artem Abzaliev

Atnafu Lambebo Tonja

Bontu Fufa Balcha

Chenxi Whitehouse

Christian Salamea-Palacios

Dan John Velasco

David Ifeoluwa Adelani

D. Meur

Emilio Villa Cueva

Fajri Koto

Fauzan Farooqui … (see 57 more)

Frederico Belcavello

Ganzorig Batnasan

Gisela Vallejo

Gráinne Caulfield

Guido Ivetta

Haiyue Song

Henok Biadglign Ademtew

Hernán Maina

Holy Lovenia

Israel Abebe Azime

Jan Christian Blaise Cruz

Jay Gala

Jiahui Geng

Jesus-German Ortiz-Barajas

Jinheon Baek

Jocelyn Dunstan

Laura Alonso Alemany

Teresa Clifford

Kumaranage Ravindu Yasas Nagasinghe

Luciana Benotti

Luis Fernando D'Haro

Marcelo Viridiano

Marcos Estecha-Garitagoitia

Maria Camila Buitrago Cabrera

Mario Rodríguez-Cantelar

Mélanie Jouitteau

Mihail Minkov Mihaylov

Mohamed Fazli Mohamed Imam

Muhammad Farid Adilazuarda

Munkhjargal Gochoo

Munkh-Erdene Otgonbold

Naome Etori

Olivier NIYOMUGISHA

Paula Mónica Silva

Pranjal A Chitale

Raj Dabre

Rendi Chevi

Ruochen Zhang

Ryandito Diandaru

Samuel Cahyawijaya

Santiago Góngora

Soyeong Jeong

Sukannya Purkayastha

Tatsuki Kuribayashi

Thanmay Jayakumar

Tiago Timponi Torrent

Toqeer Ehsan

Vladimir Araujo

Yova Kementchedjhieva

Zara Burzo

Zheng Wei Lim

Zheng Xin Yong

Oana Ignat

Joan Nwatu

Rada Mihalcea

Thamar Solorio

Alham Fikri Aji

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (oral)

doi.org

openreview.net

Expecting The Unexpected: Towards Broad Out-Of-Distribution Detection

Charles Guille-escuret

Pierre-Andre Noel

Ioannis Mitliagkas

David Vázquez

Joao Monteiro

Improving the reliability of deployed machine learning systems often involves developing methods to detect out-of-distribution (OOD) inputs.… (see more) However, existing research often narrowly focuses on samples from classes that are absent from the training set, neglecting other types of plausible distribution shifts. This limitation reduces the applicability of these methods in real-world scenarios, where systems encounter a wide variety of anomalous inputs. In this study, we categorize five distinct types of distribution shifts and critically evaluate the performance of recent OOD detection methods on each of them. We publicly release our benchmark under the name BROAD (Benchmarking Resilience Over Anomaly Diversity). Our findings reveal that while these methods excel in detecting unknown classes, their performance is inconsistent when encountering other types of distribution shifts. In other words, they only reliably detect unexpected inputs that they have been specifically designed to expect. As a first step toward broad OOD detection, we learn a generative model of existing detection scores with a Gaussian mixture. By doing so, we present an ensemble approach that offers a more consistent and comprehensive solution for broad OOD detection, demonstrating superior performance compared to existing methods. Our code to download BROAD and reproduce our experiments is publicly available.

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

doi.org

openreview.net

Learning Action and Reasoning-Centric Image Editing from Videos and Simulation

Dheeraj Vattikonda

Varun Jampani

Christopher Pal

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (spotlight)

openreview.net

LogiCity: Advancing Neuro-Symbolic AI with Abstract Urban Simulation

Bowen Li

Zhaoyu Li

Qiwei Du

Jinqi Luo

Wenshan Wang

Yaqi Xie

Simon Stepputtis

Chen Wang

Katia P. Sycara

Pradeep Kumar Ravikumar

Alexander G. Gray

Xujie Si

Sebastian Scherer

Recent years have witnessed the rapid development of Neuro-Symbolic (NeSy) AI systems, which integrate symbolic reasoning into deep neural n… (see more)etworks. However, most of the existing benchmarks for NeSy AI fail to provide long-horizon reasoning tasks with complex multi-agent interactions. Furthermore, they are usually constrained by fixed and simplistic logical rules over limited entities, making them far from real-world complexities. To address these crucial gaps, we introduce LogiCity, the first simulator based on customizable first-order logic (FOL) for an urban-like environment with multiple dynamic agents. LogiCity models diverse urban elements using semantic and spatial concepts, such as

2024-09-25

Datasets and Benchmarks Track @ Neural Information Processing Systems (poster)

doi.org

openreview.net

ReactZyme: A Benchmark for Enzyme-Reaction Prediction

Bozitao Zhong

Liang Hong

Shuangjia Zheng

Enzymes, with their specific catalyzed reactions, are necessary for all aspects of life, enabling diverse biological processes and adaptatio… (see more)ns. Predicting enzyme functions is essential for understanding biological pathways, guiding drug development, enhancing bioproduct yields, and facilitating evolutionary studies. Addressing the inherent complexities, we introduce a new approach to annotating enzymes based on their catalyzed reactions. This method provides detailed insights into specific reactions and is adaptable to newly discovered reactions, diverging from traditional classifications by protein family or expert-derived reaction classes. We employ machine learning algorithms to analyze enzyme reaction datasets, delivering a much more refined view on the functionality of enzymes. Our evaluation leverages the largest enzyme-reaction dataset to date, derived from the SwissProt and Rhea databases with entries up to January 8, 2024. We frame the enzyme-reaction prediction as a retrieval problem, aiming to rank enzymes by their catalytic ability for specific reactions. With our model, we can recruit proteins for novel reactions and predict reactions in novel proteins, facilitating enzyme discovery and function annotation (https://github.com/WillHua127/ReactZyme).

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

doi.org

openreview.net

Reconstructing Spatio-Temporal Trajectories of Visual Object Memories in the Human Brain

Julia Lifanov

Benjamin J. Griffiths

Juan Linde-Domingo

Catarina S. Ferreira

Martin Wilson

Stephen D. Mayhew

Ian Charest

Maria Wimber

2024-09-25

eNeuro (published)

doi.org

RedPajama: an Open Dataset for Training Large Language Models

Maurice Weber

Daniel Y Fu

Quentin Gregory Anthony

Yonatan Oren

Shane Adams

Anton Alexandrov

Xiaozhong Lyu

Huu Nguyen

Xiaozhe Yao

Virginia Adams

Ben Athiwaratkun

Rahul Chalamala

Kezhen Chen

Max Ryabinin

Tri Dao

Percy Liang

Christopher Re

Irina Rish

Ce Zhang

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (spotlight)

doi.org

openreview.net

Repliqa: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content

Joao Monteiro

Pierre-Andre Noel

Étienne Marcotte

Sai Rajeswar

Valentina Zantedeschi

David Vázquez

Nicolas Chapados

Christopher Pal

Perouz Taslakian

Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet. This data includ… (see more)es encyclopedic documents that harbor a vast amount of general knowledge (e.g., Wikipedia) but also potentially overlap with benchmark datasets used for evaluating LLMs. Consequently, evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions. To foster sound evaluation of language models, we introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks. RepLiQA is a collection of five splits of test sets, four of which have not been released to the internet or exposed to LLM APIs prior to this publication. Each sample in RepLiQA comprises (1) a reference document crafted by a human annotator and depicting an imaginary scenario (e.g., a news article) absent from the internet; (2) a question about the document's topic; (3) a ground-truth answer derived directly from the information in the document; and (4) the paragraph extracted from the reference document containing the answer. As such, accurate answers can only be generated if a model can find relevant content within the provided document. We run a large-scale benchmark comprising several state-of-the-art LLMs to uncover differences in performance across models of various types and sizes in a context-conditional language modeling setting. Released splits of RepLiQA can be found here: https://huggingface.co/datasets/ServiceNow/repliqa.

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

doi.org

openreview.net

TGB 2.0: A Benchmark for Learning on Temporal Knowledge Graphs and Heterogeneous Graphs

Erfan Loghmani

Emanuele Rossi

Ioannis Koutis

Heiner Stuckenschmidt

Reihaneh Rabbany

Guillaume Rabusseau

Multi-relational temporal graphs are powerful tools for modeling real-world data, capturing the evolving and interconnected nature of entiti… (see more)es over time. Recently, many novel models are proposed for ML on such graphs intensifying the need for robust evaluation and standardized benchmark datasets. However, the availability of such resources remains scarce and evaluation faces added complexity due to reproducibility issues in experimental protocols. To address these challenges, we introduce Temporal Graph Benchmark 2.0 (TGB 2.0), a novel benchmarking framework tailored for evaluating methods for predicting future links on Temporal Knowledge Graphs and Temporal Heterogeneous Graphs with a focus on large-scale datasets, extending the Temporal Graph Benchmark. TGB 2.0 facilitates comprehensive evaluations by presenting eight novel datasets spanning five domains with up to 53 million edges. TGB 2.0 datasets are significantly larger than existing datasets in terms of number of nodes, edges, or timestamps. In addition, TGB 2.0 provides a reproducible and realistic evaluation pipeline for multi-relational temporal graphs. Through extensive experimentation, we observe that 1) leveraging edge-type information is crucial to obtain high performance, 2) simple heuristic baselines are often competitive with more complex methods, 3) most methods fail to run on our largest datasets, highlighting the need for research on more scalable methods.

2024-09-25

Datasets and Benchmarks Track @ Neural Information Processing Systems (poster)

doi.org

openreview.net

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj

Harshit Gujral

Siyi Wu

Ciara Zogheib

Tegan Maharaj

Christoph Becker

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not… (see more) millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (spotlight)

doi.org

openreview.net

Using Unity to Help Solve Reinforcement Learning

Connor Brennan

Andrew Robert Williams

Omar G. Younis

Vedant Vyas

Daria Yasafova

Irina Rish

Leveraging the depth and flexibility of XLand as well as the rapid prototyping features of the Unity engine, we present the United Unity Uni… (see more)verse — an open-source toolkit designed to accelerate the creation of innovative reinforcement learning environments. This toolkit includes a robust implementation of XLand 2.0 complemented by a user-friendly interface which allows users to modify the details of procedurally generated terrains and task rules with ease. Additionally, we provide a curated selection of terrains and rule sets, accompanied by implementations of reinforcement learning baselines to facilitate quick experimentation with novel architectural designs for adaptive agents. Furthermore, we illustrate how the United Unity Universe serves as a high-level language that enables researchers to develop diverse and endlessly variable 3D environments within a unified framework. This functionality establishes the United Unity Universe (U3) as an essential tool for advancing the field of reinforcement learning, especially in the development of adaptive and generalizable learning systems.

2024-09-25

NeurIPS.cc/2024/Datasets_and_Benchmarks_Track (poster)

openreview.net

WorkArena++: Towards Compositional Planning and Reasoning-based Common Knowledge Work Tasks

Léo Boisvert

Megh Thakkar

Maxime Gasse

Massimo Caccia

Thibault Le Sellier De Chezelles

Quentin Cappart

Nicolas Chapados

Alexandre Lacoste

Alexandre Drouin

The ability of large language models (LLMs) to mimic human-like intelligence has led to a surge in LLM-based autonomous agents. Though recen… (see more)t LLMs seem capable of planning and reasoning given user instructions, their effectiveness in applying these capabilities for autonomous task solving remains underexplored. This is especially true in enterprise settings, where automated agents hold the promise of a high impact. To fill this gap, we propose WorkArena++, a novel benchmark consisting of 682 tasks corresponding to realistic workflows routinely performed by knowledge workers. WorkArena++ is designed to evaluate the planning, problem-solving, logical/arithmetic reasoning, retrieval, and contextual understanding abilities of web agents. Our empirical studies across state-of-the-art LLMs and vision-language models (VLMs), as well as human workers, reveal several challenges for such models to serve as useful assistants in the workplace. In addition to the benchmark, we provide a mechanism to effortlessly generate thousands of ground-truth observation/action traces, which can be used for fine-tuning existing models. Overall, we expect this work to serve as a useful resource to help the community progress toward capable autonomous agents. The benchmark can be found at https://github.com/ServiceNow/WorkArena.

2024-09-25

Datasets and Benchmarks Track @ Neural Information Processing Systems (poster)

doi.org

openreview.net

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Mila on Udemy

Publications

Disinformation 2.0: When AI Blurs the Lines

AI Policy Fellowship Publications

Mila on Udemy

Popular keywords:

Publications