Publications

An empirical study of testing machine learning in the wild
Moses Openja
Armstrong Foundjem
Zhen Ming (Jack) Jiang
Zhenyou Jiang
Mouna Abidi
Ahmed E. Hassan
Background: Recently, machine and deep learning (ML/DL) algorithms have been increasingly adopted in many software systems. Due to their in… (voir plus)ductive nature, ensuring the quality of these systems remains a significant challenge for the research community. Traditionally, software systems were constructed deductively, by writing explicit rules that govern the behavior of the system as program code. However, ML/DL systems infer rules from training data i.e., they are generated inductively). Recent research in ML/DL quality assurance has adapted concepts from traditional software testing, such as mutation testing, to improve reliability. However, it is unclear if these proposed testing techniques are adopted in practice, or if new testing strategies have emerged from real-world ML deployments. There is little empirical evidence about the testing strategies. Aims: To fill this gap, we perform the first fine-grained empirical study on ML testing in the wild to identify the ML properties being tested, the testing strategies, and their implementation throughout the ML workflow. Method: We conducted a mixed-methods study to understand ML software testing practices. We analyzed test files and cases from 11 open-source ML/DL projects on GitHub. Using open coding, we manually examined the testing strategies, tested ML properties, and implemented testing methods to understand their practical application in building and releasing ML/DL software systems. Results: Our findings reveal several key insights: 1.) The most common testing strategies, accounting for less than 40%, are Grey-box and White-box methods, such as Negative Testing , Oracle Approximation , and Statistical Testing . 2.) A wide range of \(17\) ML properties are tested, out of which only 20% to 30% are frequently tested, including Consistency , Correctness , and Efficiency . 3.) Bias and Fairness is more tested in Recommendation (6%) and CV (3.9%) systems, while Security & Privacy is tested in CV (2%), Application Platforms (0.9%), and NLP (0.5%). 4.) We identified 13 types of testing methods, such as Unit Testing , Input Testing , and Model Testing . Conclusions: This study sheds light on the current adoption of software testing techniques and highlights gaps and limitations in existing ML testing practices.
SCIsegV2: A Universal Tool for Segmentation of Intramedullary Lesions in Spinal Cord Injury
Enamundram Naga Karthik
Jan Valošek
Lynn Farner
Dario Pfyffer
Simon Schading-Sassenhausen
A. Lebret
Gergely David
Andrew Smith
Kenneth A. Weber
Maryam Seif
Rhscir Network Imaging Group
Patrick Freund
Tackling the Problem of Distributional Shifts: Correcting Misspecified, High-Dimensional Data-Driven Priors for Inverse Problems
Gabriel Missael Barco
Alexandre Adam
Connor Stone
Bayesian inference for inverse problems hinges critically on the choice of priors. In the absence of specific prior information, population-… (voir plus)level distributions can serve as effective priors for parameters of interest. With the advent of machine learning, the use of data-driven population-level distributions (encoded, e.g., in a trained deep neural network) as priors is emerging as an appealing alternative to simple parametric priors in a variety of inverse problems. However, in many astrophysical applications, it is often difficult or even impossible to acquire independent and identically distributed samples from the underlying data-generating process of interest to train these models. In these cases, corrupted data or a surrogate, e.g. a simulator, is often used to produce training samples, meaning that there is a risk of obtaining misspecified priors. This, in turn, can bias the inferred posteriors in ways that are difficult to quantify, which limits the potential applicability of these models in real-world scenarios. In this work, we propose addressing this issue by iteratively updating the population-level distributions by retraining the model with posterior samples from different sets of observations and showcase the potential of this method on the problem of background image reconstruction in strong gravitational lensing when score-based models are used as data-driven priors. We show that starting from a misspecified prior distribution, the updated distribution becomes progressively closer to the underlying population-level distribution, and the resulting posterior samples exhibit reduced bias after several updates.
VisMin: Visual Minimal-Change Understanding
Rabiul Awal
Saba Ahmadi
Le Zhang
Fine-grained understanding of objects, attributes, and relationships between objects is crucial for visual-language models (VLMs). Existing … (voir plus)benchmarks primarily focus on evaluating VLMs' capability to distinguish between two very similar \textit{captions} given an image. In this paper, we introduce a new, challenging benchmark termed \textbf{Vis}ual \textbf{Min}imal-Change Understanding (VisMin), which requires models to predict the correct image-caption match given two images and two captions. The image pair and caption pair contain minimal changes, i.e., only one aspect changes at a time from among the following: \textit{object}, \textit{attribute}, \textit{count}, and \textit{spatial relation}. These changes test the models' understanding of objects, attributes (such as color, material, shape), counts, and spatial relationships between objects. We built an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators. Empirical experiments reveal that current VLMs exhibit notable deficiencies in understanding spatial relationships and counting abilities. We also generate a large-scale training dataset to finetune CLIP and Idefics2, showing significant improvements in fine-grained understanding across benchmarks and in CLIP's general image-text alignment. We release all resources, including the benchmark, training data, and finetuned model checkpoints, at https://vismin.net/.
Wasserstein Distributionally Robust Shallow Convex Neural Networks
Julien Pallage
A Rapid Method for Impact Analysis of Grid-Edge Technologies on Power Distribution Networks
Feng Li
Ilhan Kocar
This paper presents a novel rapid estimation method (REM) to perform stochastic impact analysis of grid-edge technologies (GETs) to the powe… (voir plus)r distribution networks. The evolution of network states' probability density functions (PDFs) in terms of GET penetration levels are characterized by the Fokker-Planck equation (FPE). The FPE is numerically solved to compute the PDFs of network states, and a calibration process is also proposed such that the accuracy of the REM is maintained for large-scale distribution networks. The approach is illustrated on a large-scale realistic distribution network using a modified version of the IEEE 8500 feeder, where electric vehicles (EVs) or photovoltaic systems (PVs) are installed at various penetration rates. It is demonstrated from quantitative analyses that the results from our proposed approach have negligible errors comparing with those obtained from Monte Carlo simulations.
Improving Context-Aware Preference Modeling for Language Models
Silviu Pitis
Ziang Xiao
While finetuning language models from pairwise preferences has proven remarkably effective, the underspecified nature of natural language pr… (voir plus)esents critical challenges. Direct preference feedback is uninterpretable, difficult to provide where multidimensional criteria may apply, and often inconsistent, either because it is based on incomplete instructions or provided by diverse principals. To address these challenges, we consider the two-step preference modeling procedure that first resolves the under-specification by selecting a context, and then evaluates preference with respect to the chosen context. We decompose reward modeling error according to these two steps, which suggests that supervising context in addition to context-specific preference may be a viable approach to aligning models with diverse human preferences. For this to work, the ability of models to evaluate context-specific preference is critical. To this end, we contribute context-conditioned preference datasets and accompanying experiments that investigate the ability of language models to evaluate context-specific preference. We use our datasets to (1) show that existing preference models benefit from, but fail to fully consider, added context, (2) finetune a context-aware reward model with context-specific performance exceeding that of GPT-4 and Llama 3 70B on tested datasets, and (3) investigate the value of context-aware preference modeling.
Open Problems in Technical AI Governance
Anka Reuel
Benjamin Bucknall
Stephen Casper
Tim Fist
Lisa Soder
Onni Aarne
Lewis Hammond
Lujain Ibrahim
Alan Chan
Peter Wills
Markus Anderljung
Ben Garfinkel
Lennart Heim
Andrew Trask
Gabriel Mukobi
Rylan Schaeffer
Mauricio Baker
Sara Hooker
Irene Solaiman
Alexandra Luccioni … (voir 11 de plus)
Nitarshan Rajkumar
Nicolas Moes
Jeffrey Ladish
Neel Guha
Jessica Newman
Tobin South
Alex Pentland
Sanmi Koyejo
Mykel Kochenderfer
Robert F. Trager
AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the… (voir plus) barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where intervention is needed, (b) identify and assess the efficacy of potential governance actions, and (c) enhance governance options by designing mechanisms for enforcement, incentivization, or compliance. In this paper, we explain what technical AI governance is, why it is important, and present a taxonomy and incomplete catalog of its open problems. This paper is intended as a resource for technical researchers or research funders looking to contribute to AI governance.
T2VIndexer: A Generative Video Indexer for Efficient Text-Video Retrieval
Yili Li
Jing Yu
Keke Gai
Gang Xiong
Qi Wu
Current text-video retrieval methods mainly rely on cross-modal matching between queries and videos to calculate their similarity scores, wh… (voir plus)ich are then sorted to obtain retrieval results. This method considers the matching between each candidate video and the query, but it incurs a significant time cost and will increase notably with the increase of candidates. Generative models are common in natural language processing and computer vision, and have been successfully applied in document retrieval, but their application in multimodal retrieval remains unexplored. To enhance retrieval efficiency, in this paper, we introduce a model-based video indexer named T2VIndexer, which is a sequence-to-sequence generative model directly generating video identifiers and retrieving candidate videos with constant time complexity. T2VIndexer aims to reduce retrieval time while maintaining high accuracy. To achieve this goal, we propose video identifier encoding and query-identifier augmentation approaches to represent videos as short sequences while preserving their semantic information. Our method consistently enhances the retrieval efficiency of current state-of-the-art models on four standard datasets. It enables baselines with only 30%-50% of the original retrieval time to achieve better retrieval performance on MSR-VTT (+1.0%), MSVD (+1.8%), ActivityNet (+1.5%), and DiDeMo (+0.2%). The code is available at https://anonymous.4open.science/r/T2VIndexer-40BE.
Temporal Residual Jacobians For Rig-free Motion Transfer
Sanjeev Muralikrishnan
Niladri Shekhar Dutt
Siddhartha Chaudhuri
Vladimir Kim
Matthew Fisher
Niloy J. Mitra
We introduce Temporal Residual Jacobians as a novel representation to enable data-driven motion transfer. Our approach does not assume acces… (voir plus)s to any rigging or intermediate shape keyframes, produces geometrically and temporally consistent motions, and can be used to transfer long motion sequences. Central to our approach are two coupled neural networks that individually predict local geometric and temporal changes that are subsequently integrated, spatially and temporally, to produce the final animated meshes. The two networks are jointly trained, complement each other in producing spatial and temporal signals, and are supervised directly with 3D positional information. During inference, in the absence of keyframes, our method essentially solves a motion extrapolation problem. We test our setup on diverse meshes (synthetic and scanned shapes) to demonstrate its superiority in generating realistic and natural-looking animations on unseen body shapes against SoTA alternatives. Supplemental video and code are available at https://temporaljacobians.github.io/ .
Myelin basic protein mRNA levels affect myelin sheath dimensions, architecture, plasticity, and density of resident glial cells
Hooman Bagheri
Hana Friedman
Amanda Hadwen
Celia Jarweh
Ellis Cooper
Lawrence Oprea
Claire Guerrier
Anmar Khadra
Armand Collin
Amanda Young
Gerardo Mendez Victoriano
Matthew Swire
Andrew Jarjour
Marie E. Bechler
Rachel S. Pryce
Pierre Chaurand
Lise Cougnaud
Dajana Vuckovic
Elliott Wilion … (voir 11 de plus)
Owen Greene
Akiko Nishiyama
Anouk Benmamar‐Badel
Trevor Owens
Vladimir Grouza
Marius Tuznik
Hanwen Liu
David A. Rudko
Jinyi Zhang
Katherine A. Siminovitch
Alan C. Peterson
The Madness of Multiple Entries in March Madness
Jeff Decary
David Bergman
Carlos Henrique Cardonha
Jason Imbrogno
This paper explores multi-entry strategies for betting pools related to single-elimination tournaments. In such betting pools, participants … (voir plus)select winners of games, and their respective score is a weighted sum of the number of correct selections. Most betting pools have a top-heavy payoff structure, so the paper focuses on strategies that maximize the expected score of the best-performing entry. There is no known closed-formula expression for the estimation of this metric, so the paper investigates the challenges associated with the estimation and the optimization of multi-entry solutions. We present an exact dynamic programming approach for calculating the maximum expected score of any given fixed solution, which is exponential in the number of entries. We explore the structural properties of the problem to develop several solution techniques. In particular, by extracting insights from the solutions produced by one of our algorithms, we design a simple yet effective problem-specific heuristic that was the best-performing technique in our experiments, which were based on real-world data extracted from recent March Madness tournaments. In particular, our results show that the best 100-entry solution identified by our heuristic had a 2.2% likelihood of winning a