Publications

StarVector: Generating Scalable Vector Graphics Code from Images
Juan A. Rodriguez
Shubham Agarwal
Issam Hadj Laradji
Pau Rodriguez
David Vazquez
Marco Pedersoli
Scalable Vector Graphics (SVGs) have become integral in modern image rendering applications due to their infinite scalability in resolution,… (see more) versatile usability, and editing capabilities. SVGs are particularly popular in the fields of web development and graphic design. Existing approaches for SVG modeling using deep learning often struggle with generating complex SVGs and are restricted to simpler ones that require extensive processing and simplification. This paper introduces StarVector, a multimodal SVG generation model that effectively integrates Code Generation Large Language Models (CodeLLMs) and vision models. Our approach utilizes a CLIP image encoder to extract visual representations from pixel-based images, which are then transformed into visual tokens via an adapter module. These visual tokens are pre-pended to the SVG token embeddings, and the sequence is modeled by the StarCoder model using next-token prediction, effectively learning to align the visual and code tokens. This enables StarVector to generate unrestricted SVGs that accurately represent pixel images. To evaluate StarVector's performance, we present SVG-Bench, a comprehensive benchmark for evaluating SVG methods across multiple datasets and relevant metrics. Within this benchmark, we introduce novel datasets including SVG-Stack, a large-scale dataset of real-world SVG examples, and use it to pre-train StarVector as a large foundation model for SVGs. Our results demonstrate significant enhancements in visual quality and complexity handling over current methods, marking a notable advancement in SVG generation technology. Code and models: https://github.com/joanrod/star-vector
Rescuespeech: A German Corpus for Speech Recognition in Search and Rescue Domain
Sangeet Sagar
Bernd Kiefer
Ivana Kruijff-Korbayová
Josef van Genabith
Despite the recent advancements in speech recognition, there are still difficulties in accurately transcribing conversational and emotional … (see more)speech in noisy and reverberant acoustic environments. This poses a particular challenge in the search and rescue (SAR) domain, where transcribing conversations among rescue team members is crucial to support real-time decision-making. The scarcity of speech data and associated background noise in SAR scenarios make it difficult to deploy robust speech recognition systems.To address this issue, we have created and made publicly available a German speech dataset called RescueSpeech. This dataset includes real speech recordings from simulated rescue exercises. Additionally, we have released competitive training recipes and pre-trained models. Our study highlights that the performance attained by state-of-the-art methods in this challenging scenario is still far from reaching an acceptable level.
Self-supervised multimodal learning for group inferences from MRI data: Discovering disorder-relevant brain regions and multimodal links
Alex Fedorov
Eloy Geenjaar
Lei Wu
Tristan Sylvain
Thomas P. DeRamus
Margaux Luck
Maria Misiura
Girish Mittapalle
Sergey M. Plis
Vince D. Calhoun
Speech Emotion Diarization: Which Emotion Appears When?
Yingzhi Wang
Alaa Nfissi
Alya Yacoubi
Speech Emotion Recognition (SER) typically relies on utterance-level solutions. However, emotions conveyed through speech should be consider… (see more)ed as discrete speech events with definite temporal boundaries, rather than attributes of the entire utterance. To reflect the fine-grained nature of speech emotions and to unify various fine-grained methods under a single objective, we propose a new task: Speech Emotion Diarization (SED). Just as Speaker Diarization answers the question of “Who speaks when?”, Speech Emotion Diarization answers the question of “Which emotion appears when?”. To facilitate the evaluation of the performance and establish a common benchmark, we introduce the Zaion Emotion Dataset (ZED), an openly accessible speech emotion dataset that includes non-acted emotions recorded in real-life conditions, along with manually annotated boundaries of emotion segments within the utterance. We provide competitive baselines and open-source the code and the pre-trained models.
TorchAudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch
Jeff Hwang
Moto Hira
Caroline Chen
Xiaohui Zhang
Zhaoheng Ni
Guangzhi Sun
Pingchuan Ma
Ruizhe Huang
Vineel Pratap
Yuekai Zhang
Anurag Kumar
Chin-Yun Yu
Chuang Zhu
Chunxi Liu
Jacob Kahn
Peng Sun
Shinji Watanabe
Yangyang Shi
Yumeng Tao … (see 4 more)
Robin Scheibler
Samuele Cornell
Sean Kim
Stavros Petridis
TorchAudio is an open-source audio and speech processing library built for PyTorch. It aims to accelerate the research and development of au… (see more)dio and speech technologies by providing well-designed, easy-to-use, and performant PyTorch components. Its contributors routinely engage with users to understand their needs and fulfill them by developing impactful features. Here, we survey TorchAudio’s development principles and contents and highlight key features we include in its latest version (2.1): self-supervised learning pre-trained pipelines and training recipes, high-performance CTC decoders, speech recognition models and training recipes, advanced media I/O capabilities, and tools for performing forced alignment, multi-channel speech enhancement, and reference-less speech assessment. For a selection of these features, through empirical studies, we demonstrate their efficacy and show that they achieve competitive or state-of-the-art performance.
FoMo-Bench: a multi-modal, multi-scale and multi-task Forest Monitoring Benchmark for remote sensing foundation models
Nikolaos Ioannis Bountos
Arthur Ouaknine
Genetic landscape of an in vivo protein interactome
Savandara Besse
Tatsuya Sakaguchi
Louis Gauthier
Zahra Sahaf
Olivier Péloquin
Lidice Gonzalez
Xavier Castellanos-Girouard
Nazli Koçatug
Chloé Matta
Stephen W. Michnick
Adrian W.R. Serohijos
Temporal encoding in deep reinforcement learning agents
Dongyan Lin
Ann Zixiang Huang
Cone-Traced Supersampling with Subpixel Edge Reconstruction.
Andrei Chubarau
Yangyang Zhao
Ruby Rao
Paul Kry
While signed distance fields (SDFs) in theory offer infinite level of detail, they are typically rendered using the sphere tracing algorithm… (see more) at finite resolutions, which causes the common rasterized image synthesis problem of aliasing. Most existing optimized antialiasing solutions rely on polygon mesh representations; SDF-based geometry can only be directly antialiased with the computationally expensive supersampling or with post-processing filters that may produce undesirable blurriness and ghosting. In this work, we present cone-traced supersampling (CTSS), an efficient and robust spatial antialiasing solution that naturally complements the sphere tracing algorithm, does not require casting additional rays per pixel or offline prefiltering, and can be easily implemented in existing real-time SDF renderers. CTSS performs supersampling along the traced ray near surfaces with partial visibility – object contours – identified by evaluating cone intersections within a pixel's view frustum. We further introduce subpixel edge reconstruction (SER), a technique that extends CTSS to locate and resolve complex pixels with geometric edges in relatively flat regions, which are otherwise undetected by cone intersections. Our combined solution relies on a specialized sampling strategy to minimize the number of shading computations and correlates sample visibility to aggregate the samples. With comparable antialiasing quality at significantly lower computational cost, CTSS is a reliable practical alternative to conventional supersampling.
Feasibility of cognitive neuroscience data collection during a speleological expedition
Anita Paas
Hugo R. Jourde
Arnaud Brignol
Marie-Anick Savard
Zseyvfin Eyqvelle
Samuel Bassetto
Emily B.J. Coffey
Global Rewards in Multi-Agent Deep Reinforcement Learning for Autonomous Mobility on Demand Systems
Heiko Hoppe
Tobias Enders
Maximilian Schiffer
Asymmetric Actor-Critic with Approximate Information State
Amit Sinha
Reinforcement learning (RL) for partially observable Markov decision processes (POMDPs) is a challenging problem because decisions need to b… (see more)e made based on the entire history of observations and actions. However, in several scenarios, state information is available during the training phase. We are interested in exploiting the availability of this state information during the training phase to efficiently learn a history-based policy using RL. Specifically, we consider actor-critic algorithms, where the actor uses only the history information but the critic uses both history and state. Such algorithms are called asymmetric actor-critic, to highlight the fact that the actor and critic have asymmetric information. Motivated by the recent success of using representation losses in RL for POMDPs [1], we derive similar theoretical results for the asymmetric actor-critic case and evaluate the effectiveness of adding such auxiliary losses in experiments. In particular, we learn a history representation-called an approximate information state (AIS)-and bound the performance loss when acting using AIS.