A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of avail… (see more)able curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or finetuning recipes for the next generation of VLMs to come.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of avail… (see more)able curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or finetuning recipes for the next generation of VLMs to come.
A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions
Jack Urbanek
Florian Bordes
Pietro Astolfi
Mary Williamson
Vasu Sharma
Curation methods for massive vision-language datasets trade off between dataset size and quality. However, even the highest quality of avail… (see more)able curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs, we collect the Densely Captioned Images (DCI) dataset, containing 7805 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens, we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly, we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset, we hope to enable the development of new benchmarks or finetuning recipes for the next generation of VLMs to come.
Symmetry Breaking and Equivariant Neural Networks
Sékou-Oumar Kaba
Using symmetry as an inductive bias in deep learning has been proven to be a principled approach for sample-efficient model design. However,… (see more) the relationship between symmetry and the imperative for equivariance in neural networks is not always obvious. Here, we analyze a key limitation that arises in equivariant functions: their incapacity to break symmetry at the level of individual data samples. In response, we introduce a novel notion of 'relaxed equivariance' that circumvents this limitation. We further demonstrate how to incorporate this relaxation into equivariant multilayer perceptrons (E-MLPs), offering an alternative to the noise-injection method. The relevance of symmetry breaking is then discussed in various application domains: physics, graph representation learning, combinatorial optimization and equivariant decoding.
Asymmetric Actor-Critic with Approximate Information State
Amit Sinha
Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) is a challenging problem because decisions need to b… (see more)e made based on the entire history of observations and actions. However, in several scenarios, state information is available during the training phase. We are interested in exploiting the availability of this state information during the training phase to efficiently learn a history-based policy using RL. Specifically, we consider actor-critic algorithms, where the actor uses only the history information but the critic uses both history and state. Such algorithms are called asymmetric actor-critic, to highlight the fact that the actor and critic have asymmetric information. Motivated by the recent success of using representation losses in RL for POMDPs [1], we derive similar theoretical results for the asymmetric actor-critic case and evaluate the effectiveness of adding such auxiliary losses in experiments. In particular, we learn a history representation-called an approximate information state (AIS)-and bound the performance loss when acting using AIS.
Current Practices in Voice Data Collection and Limitations to Voice AI Research: A National Survey.
Emily Evangelista
Rohan Kale
Desiree McCutcheon
Anais Rameau
Alexander H. Gelbard
Maria Powell
Michael Johns
Anthony Law
Phillip C Song
M. Naunheim
Stephanie Watts
Paul C. Bryson
Matthew G. Crowson
Jeremy M. Pinto
Yael Bensoussan
INTRODUCTION Accuracy and validity of voice AI algorithms rely on substantial quality voice data. Although commensurable amounts of voice da… (see more)ta are captured daily in voice centers across North America, there is no standardized protocol for acoustic data management, which limits the usability of these datasets for voice artificial intelligence (AI) research. OBJECTIVE The aim was to capture current practices of voice data collection, storage, analysis, and perceived limitations to collaborative voice research. METHODS A 30-question online survey was developed with expert guidance from the voicecollab.ai members, an international collaborative of voice AI researchers. The survey was disseminated via REDCap to an estimated 200 practitioners at North American voice centers. Survey questions assessed respondents' current practices in terms of acoustic data collection, storage, and retrieval as well as limitations to collaborative voice research. RESULTS Seventy-two respondents completed the survey of which 81.7% were laryngologists and 18.3% were speech language pathologists (SLPs). Eighteen percent of respondents reported seeing 40%-60% and 55% reported seeing >60 patients with voice disorders weekly (conservative estimate of over 4000 patients/week). Only 28% of respondents reported utilizing standardized protocols for collection and storage of acoustic data. Although, 87% of respondents conduct voice research, only 38% of respondents report doing so on a multi-institutional level. Perceived limitations to conducting collaborative voice research include lack of standardized methodology for collection (30%) and lack of human resources to prepare and label voice data adequately (55%). CONCLUSION To conduct large-scale multi-institutional voice research with AI, there is a pertinent need for standardization of acoustic data management, as well as an infrastructure for secure and efficient data sharing. LEVEL OF EVIDENCE Level 5 Laryngoscope, 2023.
Current Practices in Voice Data Collection and Limitations to Voice AI Research: A National Survey.
Emily Evangelista
Rohan Kale
Desiree McCutcheon
Anais Rameau
Alexander Gelbard
Maria Powell
Michael Johns
Anthony Law
Phillip Song
Matthew Naunheim
Stephanie Watts
Paul C. Bryson
Matthew G. Crowson
Jeremy Pinto
Yael Bensoussan
INTRODUCTION Accuracy and validity of voice AI algorithms rely on substantial quality voice data. Although commensurable amounts of voice da… (see more)ta are captured daily in voice centers across North America, there is no standardized protocol for acoustic data management, which limits the usability of these datasets for voice artificial intelligence (AI) research. OBJECTIVE The aim was to capture current practices of voice data collection, storage, analysis, and perceived limitations to collaborative voice research. METHODS A 30-question online survey was developed with expert guidance from the voicecollab.ai members, an international collaborative of voice AI researchers. The survey was disseminated via REDCap to an estimated 200 practitioners at North American voice centers. Survey questions assessed respondents' current practices in terms of acoustic data collection, storage, and retrieval as well as limitations to collaborative voice research. RESULTS Seventy-two respondents completed the survey of which 81.7% were laryngologists and 18.3% were speech language pathologists (SLPs). Eighteen percent of respondents reported seeing 40%-60% and 55% reported seeing >60 patients with voice disorders weekly (conservative estimate of over 4000 patients/week). Only 28% of respondents reported utilizing standardized protocols for collection and storage of acoustic data. Although, 87% of respondents conduct voice research, only 38% of respondents report doing so on a multi-institutional level. Perceived limitations to conducting collaborative voice research include lack of standardized methodology for collection (30%) and lack of human resources to prepare and label voice data adequately (55%). CONCLUSION To conduct large-scale multi-institutional voice research with AI, there is a pertinent need for standardization of acoustic data management, as well as an infrastructure for secure and efficient data sharing. LEVEL OF EVIDENCE Level 5 Laryngoscope, 2023.
A deep learning benchmark for first break detection from hardrock seismic reflection data
Pierre-Luc St-Charles
Bruno Rousseau
Joumana Ghosn
Gilles Bellefleur
Ernst Schetselaar
Privacy-preserving analysis of time-to-event data under nested case-control sampling
Lamin Juwara
Ana M Velly
Paramita Saha-Chaudhuri
Q-learners Can Provably Collude in the Iterated Prisoner's Dilemma
Quentin Bertrand
Juan Duque
Emilio Calvano
The deployment of machine learning systems in the market economy has triggered academic and institutional fears over potential tacit collusi… (see more)on between fully automated agents. Multiple recent economics studies have empirically shown the emergence of collusive strategies from agents guided by machine learning algorithms. In this work, we prove that multi-agent Q-learners playing the iterated prisoner's dilemma can learn to collude. The complexity of the cooperative multi-agent setting yields multiple fixed-point policies for
Relative Almost Sure Regret Bounds for Certainty Equivalence Control of Markov Jump Systems
Borna Sayedana
Mohammad Afshari
Peter E. Caines
In this paper, we consider learning and control problem in an unknown Markov jump linear system (MJLS) with perfect state observations. We f… (see more)irst establish a generic upper bound on regret for any learning based algorithm. We then propose a certainty equivalence-based learning alagrithm and show that this algorithm achieves a regret of
Weighted-Norm Bounds on Model Approximation in MDPs with Unbounded Per-Step Cost
Berk Bozkurt
Ashutosh Nayyar
Yi Ouyang
We consider the problem of designing a control policy for an infinite-horizon discounted cost Markov Decision Process (MDP) …