Publications

The State of Data Curation at NeurIPS: An Assessment of Dataset Development Practices in the Datasets and Benchmarks Track

Eshta Bhardwaj

Harshit Gujral

Siyi Wu

Ciara Zogheib

Tegan Maharaj

Christoph Becker

Data curation is a field with origins in librarianship and archives, whose scholarship and thinking on data issues go back centuries, if not… (voir plus) millennia. The field of machine learning is increasingly observing the importance of data curation to the advancement of both applications and fundamental understanding of machine learning models - evidenced not least by the creation of the Datasets and Benchmarks track itself. This work provides an analysis of dataset development practices at NeurIPS through the lens of data curation. We present an evaluation framework for dataset documentation, consisting of a rubric and toolkit developed through a literature review of data curation principles. We use the framework to assess the strengths and weaknesses in current dataset development practices of 60 datasets published in the NeurIPS Datasets and Benchmarks track from 2021-2023. We summarize key findings and trends. Results indicate greater need for documentation about environmental footprint, ethical considerations, and data management. We suggest targeted strategies and resources to improve documentation in these areas and provide recommendations for the NeurIPS peer-review process that prioritize rigorous data curation in ML. Finally, we provide results in the format of a dataset that showcases aspects of recommended data curation practices. Our rubric and results are of interest for improving data curation practices broadly in the field of ML as well as to data curation and science and technology studies scholars studying practices in ML. Our aim is to support continued improvement in interdisciplinary research on dataset practices, ultimately improving the reusability and reproducibility of new datasets and benchmarks, enabling standardized and informed human oversight, and strengthening the foundation of rigorous and responsible ML research.

2024-10-29

ArXiv (prépublication)

doi.org

arxiv.org

Doctoral Symposium Committee

Anthony Cleve

Christian Lange

Silvia Breu

Manar H. Alalfi

Mario Luca Bernardi

Cornelia Boldyreff

Marco D'Ambros

Simon Denier

Natalia Dragan

Ekwa Duala-Ekoko

Fausto Fasano

Adnane Ghannem

Carmine Gravino

Maen Hammad

Imed Hammouda

Salima Hassaine

Yue Jia

Zhen Ming Jiang

Foutse Khomh

Adam Kiezun … (voir 11 de plus)

Jay Kothari

Jonathan Memaitre

Naouel Moha

Rocco Oliveto

Denys Poshyvanyk

Michele Risi

Giuseppe Scanniello

Bonita Sharif

Andrew Sutton

Anis Yousefi

Eugenio Zimeo

Manar H. Alalfi Mario Luca Bernardi Cornelia Boldyreff Anthony Cleve Marco D'Ambros Simon Denier Natalia Dragan Ekwa Duala-Ekoko Fausto Fasa… (voir plus)no Adnane Ghannem Carmine Gravino Maen Hammad Imed Hammouda Salima Hassaine Yue Jia Zhen Ming Jiang Foutse Khomh Adam Kiezun Jay Kothari Jonathan Memaitre Naouel Moha Rocco Oliveto Denys Poshyvanyk Michele Risi Giuseppe Scanniello Bonita Sharif Andrew Sutton Anis Yousefi Eugenio Zimeo

2024-10-28

2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW) (publié)

doi.org

Doctoral Symposium Committee

Anthony Cleve

Christian Lange

Silvia Breu

Manar H. Alalfi

Mario Luca Bernardi

Cornelia Boldyreff

Marco D'Ambros

Simon Denier

Natalia Dragan

Ekwa Duala-Ekoko

Fausto Fasano

Adnane Ghannem

Carmine Gravino

Maen Hammad

Imed Hammouda

Salima Hassaine

Yue Jia

Zhen Ming (Jack) Jiang

Foutse Khomh

Adam Kiezun … (voir 11 de plus)

Jay Kothari

Jonathan Memaitre

Naouel Moha

Rocco Oliveto

Denys Poshyvanyk

Michele Risi

Giuseppe Scanniello

Bonita Sharif

Andrew Sutton

Anis Yousefi

Eugenio Zimeo

Manar H. Alalfi Mario Luca Bernardi Cornelia Boldyreff Anthony Cleve Marco D'Ambros Simon Denier Natalia Dragan Ekwa Duala-Ekoko Fausto Fasa… (voir plus)no Adnane Ghannem Carmine Gravino Maen Hammad Imed Hammouda Salima Hassaine Yue Jia Zhen Ming Jiang Foutse Khomh Adam Kiezun Jay Kothari Jonathan Memaitre Naouel Moha Rocco Oliveto Denys Poshyvanyk Michele Risi Giuseppe Scanniello Bonita Sharif Andrew Sutton Anis Yousefi Eugenio Zimeo

2024-10-28

2024 IEEE 35th International Symposium on Software Reliability Engineering Workshops (ISSREW) (publié)

doi.org

General Causal Imputation via Synthetic Interventions

Marco Jiralerspong

Thomas Jiralerspong

Vedant Shah

Dhanya Sridhar

Gauthier Gidel

2024-10-28

ArXiv (prépublication)

arxiv.org

General Causal Imputation via Synthetic Interventions

Marco Jiralerspong

Thomas Jiralerspong

Vedant Shah

Dhanya Sridhar

Gauthier Gidel

Given two sets of elements (such as cell types and drug compounds), researchers typically only have access to a limited subset of their inte… (voir plus)ractions. The task of causal imputation involves using this subset to predict unobserved interactions. Squires et al. (2022) have proposed two estimators for this task based on the synthetic interventions (SI) estimator: SI-A (for actions) and SI-C (for contexts). We extend their work and introduce a novel causal imputation estimator, generalized synthetic interventions (GSI). We prove the identifiability of this estimator for data generated from a more complex latent factor model. On synthetic and real data we show empirically that it recovers or outperforms their estimators.

2024-10-28

ArXiv (prépublication)

doi.org

arxiv.org

Investigating the Benefits of Nonlinear Action Maps in Data-Driven Teleoperation

Michael Przystupa

Gauthier Gidel

Matthew E. Taylor

Martin Jägersand

Justus Piater

Samuele Tosatto

As robots become more common for both able-bodied individuals and those living with a disability, it is increasingly important that lay peop… (voir plus)le be able to drive multi-degree-of-freedom platforms with low-dimensional controllers. One approach is to use state-conditioned action mapping methods to learn mappings between low-dimensional controllers and high DOF manipulators -- prior research suggests these mappings can simplify the teleoperation experience for users. Recent works suggest that neural networks predicting a local linear function are superior to the typical end-to-end multi-layer perceptrons because they allow users to more easily undo actions, providing more control over the system. However, local linear models assume actions exist on a linear subspace and may not capture nuanced actions in training data. We observe that the benefit of these mappings is being an odd function concerning user actions, and propose end-to-end nonlinear action maps which achieve this property. Unfortunately, our experiments show that such modifications offer minimal advantages over previous solutions. We find that nonlinear odd functions behave linearly for most of the control space, suggesting architecture structure improvements are not the primary factor in data-driven teleoperation. Our results suggest other avenues, such as data augmentation techniques and analysis of human behavior, are necessary for action maps to become practical in real-world applications, such as in assistive robotics to improve the quality of life of people living with w disability.

2024-10-28

ArXiv (prépublication)

doi.org

arxiv.org

Trajectory Flow Matching with Applications to Clinical Time Series Modeling

Xi Zhang

Yuan Pu

Yuki Kawamura

Andrew Loza

Yoshua Bengio

Dennis Shung

Alexander Tong

Modeling stochastic and irregularly sampled time series is a challenging problem found in a wide range of applications, especially in medici… (voir plus)ne. Neural stochastic differential equations (Neural SDEs) are an attractive modeling technique for this problem, which parameterize the drift and diffusion terms of an SDE with neural networks. However, current algorithms for training Neural SDEs require backpropagation through the SDE dynamics, greatly limiting their scalability and stability. To address this, we propose Trajectory Flow Matching (TFM), which trains a Neural SDE in a simulation-free manner, bypassing backpropagation through the dynamics. TFM leverages the flow matching technique from generative modeling to model time series. In this work we first establish necessary conditions for TFM to learn time series data. Next, we present a reparameterization trick which improves training stability. Finally, we adapt TFM to the clinical time series setting, demonstrating improved performance on three clinical time series datasets both in terms of absolute performance and uncertainty prediction.

2024-10-28

ArXiv (prépublication)

doi.org

arxiv.org

Trajectory Flow Matching with Applications to Clinical Time Series Modeling

Xi Zhang

Yuan Pu

Yuki Kawamura

Andrew Loza

Yoshua Bengio

Dennis Shung

Alexander Tong

Modeling stochastic and irregularly sampled time series is a challenging problem found in a wide range of applications, especially in medici… (voir plus)ne. Neural stochastic differential equations (Neural SDEs) are an attractive modeling technique for this problem, which parameterize the drift and diffusion terms of an SDE with neural networks. However, current algorithms for training Neural SDEs require backpropagation through the SDE dynamics, greatly limiting their scalability and stability. To address this, we propose Trajectory Flow Matching (TFM), which trains a Neural SDE in a simulation-free manner, bypassing backpropagation through the dynamics. TFM leverages the flow matching technique from generative modeling to model time series. In this work we first establish necessary conditions for TFM to learn time series data. Next, we present a reparameterization trick which improves training stability. Finally, we adapt TFM to the clinical time series setting, demonstrating improved performance on three clinical time series datasets both in terms of absolute performance and uncertainty prediction.

2024-10-28

ArXiv (prépublication)

doi.org

arxiv.org

In-Simulation Testing of Deep Learning Vision Models in Autonomous Robotic Manipulators

Dmytro Humeniuk

Houssem Ben Braiek

Thomas Reid

Foutse Khomh

2024-10-27

Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (publié)

doi.org

arxiv.org

ProtSCAPE: Mapping the landscape of protein conformations in molecular dynamics

Siddharth Viswanath

Dhananjay Bhaskar

David R. Johnson

João F. Rocha

Egbert Castro

Jackson Grady

Alex T. Grigas

Michael Perlmutter

Corey S. O'Hern

Smita Krishnaswamy

Understanding the dynamic nature of protein structures is essential for comprehending their biological functions. While significant progress… (voir plus) has been made in predicting static folded structures, modeling protein motions on microsecond to millisecond scales remains challenging. To address these challenges, we introduce a novel deep learning architecture, Protein Transformer with Scattering, Attention, and Positional Embedding (ProtSCAPE), which leverages the geometric scattering transform alongside transformer-based attention mechanisms to capture protein dynamics from molecular dynamics (MD) simulations. ProtSCAPE utilizes the multi-scale nature of the geometric scattering transform to extract features from protein structures conceptualized as graphs and integrates these features with dual attention structures that focus on residues and amino acid signals, generating latent representations of protein trajectories. Furthermore, ProtSCAPE incorporates a regression head to enforce temporally coherent latent representations.

2024-10-27

ArXiv (prépublication)

doi.org

arxiv.org

Brain-like learning with exponentiated gradients

Jonathan Cornford

Roman Pogodin

Arna Ghosh

Kaiwen Sheng

Brendan A. Bicknell

Olivier Codol

Beverley A. Clark

Guillaume Lajoie

Blake Richards

2024-10-26

bioRxiv (prépublication)

doi.org

Efficient Biological Data Acquisition through Inference Set Design

Ihor Neporozhnii

Julien Roy

Emmanuel Bengio

Jason Hartford

In drug discovery, highly automated high-throughput laboratories are used to screen a large number of compounds in search of effective drugs… (voir plus). These experiments are expensive, so one might hope to reduce their cost by only experimenting on a subset of the compounds, and predicting the outcomes of the remaining experiments. In this work, we model this scenario as a sequential subset selection problem: we aim to select the smallest set of candidates in order to achieve some desired level of accuracy for the system as a whole. Our key observation is that, if there is heterogeneity in the difficulty of the prediction problem across the input space, selectively obtaining the labels for the hardest examples in the acquisition pool will leave only the relatively easy examples to remain in the inference set, leading to better overall system performance. We call this mechanism inference set design, and propose the use of a confidence-based active learning solution to prune out these challenging examples. Our algorithm includes an explicit stopping criterion that interrupts the acquisition loop when it is sufficiently confident that the system has reached the target performance. Our empirical studies on image and molecular datasets, as well as a real-world large-scale biological assay, show that active learning for inference set design leads to significant reduction in experimental cost while retaining high system performance.

2024-10-25

ArXiv (prépublication)

doi.org

arxiv.org

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications