Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding
Fabian David Schmidt
Ivan Vuli'c
Goran Glavavs
Spoken language understanding (SLU) is indispensable for half of all living languages that lack a formal writing system, since these languag… (see more)es cannot pair automatic speech recognition (ASR) with language models to benefit from language technology. Even if low-resource languages possess a writing system, ASR for these languages remains unreliable due to limited bimodal speech and text training data. Better SLU can strengthen the robustness of massively multilingual ASR by levering language semantics to disambiguate utterances via context or exploiting semantic similarities across languages. However, the evaluation of multilingual SLU remains limited to shallow tasks such as intent classification or language identification. To address this, we present Fleurs-SLU, a multilingual SLU benchmark that encompasses (i) 692 hours of speech for topical utterance classification in 102 languages and (ii) multiple-choice question answering through listening comprehension spanning 944 hours of speech across 92 languages. We extensively evaluate both end-to-end speech classification models and cascaded systems that combine speech-to-text transcription with subsequent classification by large language models on Fleurs-SLU. Our results show that cascaded systems exhibit greater robustness in multilingual SLU tasks, though speech encoders can achieve competitive performance in topical speech classification when appropriately pre-trained. We further find a strong correlation between robust multilingual ASR, effective speech-to-text translation, and strong multilingual SLU, highlighting the mutual benefits between acoustic and semantic speech representations.
Open Problems in Machine Unlearning for AI Safety
Fazl Barez
Tingchen Fu
Ameya Prabhu
Stephen Casper
Amartya Sanyal
Adel Bibi
Aidan O'Gara
Robert Kirk
Benjamin Bucknall
Tim Fist
Luke Ong
Philip H. S. Torr
Kwok-Yan Lam
Robert F. Trager
Sören Mindermann
Jose Hernandez-Orallo
Mor Geva
Yarin Gal
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research… (see more), and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
Open Problems in Machine Unlearning for AI Safety
Fazl Barez
Tingchen Fu
Ameya Prabhu
Stephen Casper
Amartya Sanyal
Adel Bibi
Aidan O'Gara
Robert Kirk
Benjamin Bucknall
Timothy Fist
Luke Ong
Philip Torr
Kwok-Yan Lam
Robert Trager
Sören Mindermann
Jose Hernandez-Orallo
Mor Geva
Yarin Gal
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research… (see more), and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
GNN-based Decentralized Perception in Multirobot Systems for Predicting Worker Actions
Ali Imran
David St-Onge
In industrial environments, predicting human actions is essential for ensuring safe and effective collaboration between humans and robots. T… (see more)his paper introduces a perception framework that enables mobile robots to understand and share information about human actions in a decentralized way. The framework first allows each robot to build a spatial graph representing its surroundings, which it then shares with other robots. This shared spatial data is combined with temporal information to track human behavior over time. A swarm-inspired decision-making process is used to ensure all robots agree on a unified interpretation of the human's actions. Results show that adding more robots and incorporating longer time sequences improve prediction accuracy. Additionally, the consensus mechanism increases system resilience, making the multi-robot setup more reliable in dynamic industrial settings.
Top-down feedback matters: Functional impact of brainlike connectivity motifs on audiovisual integration
Mashbayar Tugsbayar
Mingze Li
Artificial neural networks (ANNs) are an important tool for studying neural computation, but many features of the brain are not captured by … (see more)standard ANN architectures. One notable missing feature in most ANN models is top-down feedback, i.e. projections from higher-order layers to lower-order layers in the network. Top-down feedback is ubiquitous in the brain, and it has a unique modulatory impact on activity in neocortical pyramidal neurons. However, we still do not understand its computational role. Here we develop a deep neural network model that captures the core functional properties of top-down feedback in the neocortex, allowing us to construct hierarchical recurrent ANN models that more closely reflect the architecture of the brain. We use this to explore the impact of different hierarchical recurrent architectures on an audiovisual integration task. We find that certain hierarchies, namely those that mimic the architecture of the human brain, impart ANN models with a light visual bias similar to that seen in humans. This bias does not impair performance on the audiovisual tasks. The results further suggest that different configurations of top-down feedback make otherwise identically connected models functionally distinct from each other, and from traditional feedforward-only models. Altogether our findings demonstrate that modulatory top-down feedback is a computationally relevant feature of biological brains, and that incorporating it into ANNs can affect their behavior and helps to determine the solutions that the network can discover.
Adaptive Experiments Under High-Dimensional and Data Sparse Settings: Applications for Educational Platforms
Haochen Song
Ilya Musabirov
Ananya Bhattacharjee
Meredith Franklin
Anna Rafferty
Joseph Jay Williams
Adaptive Experiments Under High-Dimensional and Data Sparse Settings: Applications for Educational Platforms
Haochen Song
Ilya Musabirov
Ananya Bhattacharjee
Meredith Franklin
Anna Rafferty
Joseph Jay Williams
In online educational platforms, adaptive experiment designs play a critical role in personalizing learning pathways, instructional sequenci… (see more)ng, and content recommendations. Traditional adaptive policies, such as Thompson Sampling, struggle with scalability in high-dimensional and sparse settings such as when there are large amount of treatments (arms) and limited resources such as funding and time to conduct to a classroom constraint student size. Furthermore, the issue of under-exploration in large-scale educational interventions can lead to suboptimal learning recommendations. To address these challenges, we build upon the concept of lenient regret, which tolerates limited suboptimal selections to enhance exploratory learning, and propose a framework for determining the feasible number of treatments given a sample size. We illustrate these ideas with a case study in online educational learnersourcing examples, where adaptive algorithms dynamically allocate peer-crafted interventions to other students under active recall exercise. Our proposed Weighted Allocation Probability Adjusted Thompson Sampling (WAPTS) algorithm enhances the efficiency of treatment allocation by adjusting sampling weights to balance exploration and exploitation in data-sparse environments. We present comparative evaluations of WAPTS across various sample sizes (N=50, 300, 1000) and treatment conditions, demonstrating its ability to mitigate under-exploration while optimizing learning outcomes.
Galaxy cluster characterization with machine learning techniques
Maria Sadikov
Julie Hlavacek-larrondo
C. Rhea
Michael McDonald
Michelle Ntampaka
John ZuHone
We present an analysis of the X-ray properties of the galaxy cluster population in the z=0 snapshot of the IllustrisTNG simulations, utilizi… (see more)ng machine learning techniques to perform clustering and regression tasks. We examine five properties of the hot gas (the central cooling time, the central electron density, the central entropy excess, the concentration parameter, and the cuspiness) which are commonly used as classification metrics to identify cool core (CC), weak cool core (WCC) and non cool core (NCC) clusters of galaxies. Using mock Chandra X-ray images as inputs, we first explore an unsupervised clustering scheme to see how the resulting groups correlate with the CC/WCC/NCC classification based on the different criteria. We observe that the groups replicate almost exactly the separation of the galaxy cluster images when classifying them based on the concentration parameter. We then move on to a regression task, utilizing a ResNet model to predict the value of all five properties. The network is able to achieve a mean percentage error of 1.8% for the central cooling time, and a balanced accuracy of 0.83 on the concentration parameter, making them the best-performing metrics. Finally, we use simulation-based inference (SBI) to extract posterior distributions for the network predictions. Our neural network simultaneously predicts all five classification metrics using only mock Chandra X-ray images. This study demonstrates that machine learning is a viable approach for analyzing and classifying the large galaxy cluster datasets that will soon become available through current and upcoming X-ray surveys, such as eROSITA.
Galaxy cluster characterization with machine learning techniques
Maria Sadikov
Julie Hlavacek-larrondo
C. Rhea
Michael McDonald
Michelle Ntampaka
John ZuHone
We present an analysis of the X-ray properties of the galaxy cluster population in the z=0 snapshot of the IllustrisTNG simulations, utilizi… (see more)ng machine learning techniques to perform clustering and regression tasks. We examine five properties of the hot gas (the central cooling time, the central electron density, the central entropy excess, the concentration parameter, and the cuspiness) which are commonly used as classification metrics to identify cool core (CC), weak cool core (WCC) and non cool core (NCC) clusters of galaxies. Using mock Chandra X-ray images as inputs, we first explore an unsupervised clustering scheme to see how the resulting groups correlate with the CC/WCC/NCC classification based on the different criteria. We observe that the groups replicate almost exactly the separation of the galaxy cluster images when classifying them based on the concentration parameter. We then move on to a regression task, utilizing a ResNet model to predict the value of all five properties. The network is able to achieve a mean percentage error of 1.8% for the central cooling time, and a balanced accuracy of 0.83 on the concentration parameter, making them the best-performing metrics. Finally, we use simulation-based inference (SBI) to extract posterior distributions for the network predictions. Our neural network simultaneously predicts all five classification metrics using only mock Chandra X-ray images. This study demonstrates that machine learning is a viable approach for analyzing and classifying the large galaxy cluster datasets that will soon become available through current and upcoming X-ray surveys, such as eROSITA.
Galaxy cluster characterization with machine learning techniques
Maria Sadikov
Julie Hlavacek-larrondo
C. Rhea
Michael McDonald
Michelle Ntampaka
John ZuHone
We present an analysis of the X-ray properties of the galaxy cluster population in the z=0 snapshot of the IllustrisTNG simulations, utilizi… (see more)ng machine learning techniques to perform clustering and regression tasks. We examine five properties of the hot gas (the central cooling time, the central electron density, the central entropy excess, the concentration parameter, and the cuspiness) which are commonly used as classification metrics to identify cool core (CC), weak cool core (WCC) and non cool core (NCC) clusters of galaxies. Using mock Chandra X-ray images as inputs, we first explore an unsupervised clustering scheme to see how the resulting groups correlate with the CC/WCC/NCC classification based on the different criteria. We observe that the groups replicate almost exactly the separation of the galaxy cluster images when classifying them based on the concentration parameter. We then move on to a regression task, utilizing a ResNet model to predict the value of all five properties. The network is able to achieve a mean percentage error of 1.8% for the central cooling time, and a balanced accuracy of 0.83 on the concentration parameter, making them the best-performing metrics. Finally, we use simulation-based inference (SBI) to extract posterior distributions for the network predictions. Our neural network simultaneously predicts all five classification metrics using only mock Chandra X-ray images. This study demonstrates that machine learning is a viable approach for analyzing and classifying the large galaxy cluster datasets that will soon become available through current and upcoming X-ray surveys, such as eROSITA.