Publications

AfriHate: A Multilingual Collection of Hate Speech and Abusive Language Datasets for African Languages
Shamsuddeen Hassan Muhammad
Idris Abdulmumin
Abinew Ayele
Ibrahim Ahmad
Saminu Mohammad Aliyu
Nelson Odhiambo Onyango
Lilian D. A. Wanzare
Samuel Rutunda
Lukman Jibril Aliyu
Esubalew Alemneh
Oumaima Hourrane
Hagos Gebremichael
Elyas Abdi Ismail
Meriem Beloucif
Ebrahim Chekol Jibril
Andiswa Bukula
Rooweither Mabuya
Salomey Osei
Abigail Oppong … (see 7 more)
Tadesse Belay
Tadesse Kebede Guge
Tesfa Tegegne Asfaw
Chiamaka Ijeoma Chukwuneke
Paul Rottger
Seid Muhie Yimam
Nedjma OUSIDHOUM
Hate speech and abusive language are global phenomena that need socio-cultural background knowledge to be understood, identified, and modera… (see more)ted. However, in many regions of the Global South, there have been several documented occurrences of (1) absence of moderation and (2) censorship due to the reliance on keyword spotting out of context. Further, high-profile individuals have frequently been at the center of the moderation process, while large and targeted hate speech campaigns against minorities have been overlooked. These limitations are mainly due to the lack of high-quality data in the local languages and the failure to include local communities in the collection, annotation, and moderation processes. To address this issue, we present AfriHate: a multilingual collection of hate speech and abusive language datasets in 15 African languages. Each instance in AfriHate is annotated by native speakers familiar with the local culture. We report the challenges related to the construction of the datasets and present various classification baseline results with and without using LLMs. The datasets, individual annotations, and hate speech and offensive language lexicons are available on https://github.com/AfriHate/AfriHate
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision
Diego Velazquez
Pau Rodriguez
Sergio Alonso
Josep M. Gonfaus
Jordi Gonzàlez 0001
Gerardo Richarte
Javier Marin
Alexandre Lacoste
This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhanc… (see more)e deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.
Integrating food webs in species distribution models can improve ecological niche estimation and predictions
Giovanni Poggiato
Jérémy Andréoletti
Wilfried Thuiller
Biotic interactions play a fundamental role in shaping multitrophic species communities, yet incorporating these interactions into species d… (see more)istribution models (SDMs) remains challenging. With the growing availability of species interaction networks, it is now feasible to integrate these interactions into SDMs for more comprehensive predictions. Here, we propose a novel framework that combines trophic interaction networks with Bayesian structural equation models, enabling each species to be modeled based on its interactions with predators or prey alongside environmental factors. This framework addresses issues of multicollinearity and error propagation, making it possible to predict species distributions in unobserved locations or under future environmental conditions, even when prey or predator distributions are unknown. We tested and validated our framework on realistic simulated communities spanning different theoretical models and ecological setups. scenarios. Our approach significantly improved the estimation of both potential and realized niches compared to single SDMs, with mean performance gains of 8% and 6%, respectively. These improvements were especially notable for species strongly regulated by biotic factors, thereby enhancing model predictive accuracy. Our framework supports integration with various SDM extensions, such as occupancy and integrated models, offering flexibility and adaptability for future developments. While not a universal solution that consistently outperforms single SDMs, our approach provides a valuable new tool for modeling multitrophic community distributions when biotic interactions are known or assumed.
Symmetry-Aware Generative Modeling through Learned Canonicalization
Kusha Sareen
Daniel Levy
Arnab Kumar Mondal
Sékou-Oumar Kaba
Tara Akhound-Sadegh
Generative modeling of symmetric densities has a range of applications in AI for science, from drug discovery to physics simulations. The ex… (see more)isting generative modeling paradigm for invariant densities combines an invariant prior with an equivariant generative process. However, we observe that this technique is not necessary and has several drawbacks resulting from the limitations of equivariant networks. Instead, we propose to model a learned slice of the density so that only one representative element per orbit is learned. To accomplish this, we learn a group-equivariant canonicalization network that maps training samples to a canonical pose and train a non-equivariant generative model over these canonicalized samples. We implement this idea in the context of diffusion models. Our preliminary experimental results on molecular modeling are promising, demonstrating improved sample quality and faster inference time.
The oneirogen hypothesis: modeling the hallucinatory effects of classical psychedelics in terms of replay-dependent plasticity mechanisms
Colin Bredenberg
Fabrice Normandin
Classical psychedelics induce complex visual hallucinations in humans, generating percepts that are co-herent at a low level, but which have… (see more) surreal, dream-like qualities at a high level. While there are many hypotheses as to how classical psychedelics could induce these effects, there are no concrete mechanistic models that capture the variety of observed effects in humans, while remaining consistent with the known pharmacological effects of classical psychedelics on neural circuits. In this work, we propose the “oneirogen hypothesis”, which posits that the perceptual effects of classical psychedelics are a result of their pharmacological actions inducing neural activity states that truly are more similar to dream-like states. We simulate classical psychedelics’ effects via manipulating neural network models trained on perceptual tasks with the Wake-Sleep algorithm. This established machine learning algorithm leverages two activity phases, a perceptual phase (wake) where sensory inputs are encoded, and a generative phase (dream) where the network internally generates activity consistent with stimulus-evoked responses. We simulate the action of psychedelics by partially shifting the model to the ‘Sleep’ state, which entails a greater influence of top-down connections, in line with the impact of psychedelics on apical dendrites. The effects resulting from this manipulation capture a number of experimentally observed phenomena including the emergence of hallucinations, increases in stimulus-conditioned variability, and large increases in synaptic plasticity. We further provide a number of testable predictions which could be used to validate or invalidate our oneirogen hypothesis.
AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi
Israel Abebe Azime
Miaoran Zhang
Cristina España-Bonet
Rachel Bawden
D. Zhu
Clement Odoje
Idris Akinade
Iffat Maab
Davis David
Shamsuddeen Hassan Muhammad
Neo Putini
David O. Ademuyiwa
Andrew Caines
Dietrich Klakow
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, … (see more)Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.
AFRIDOC-MT: Document-level MT Corpus for African Languages
Jesujoba Oluwadara Alabi
Israel Abebe Azime
Miaoran Zhang
Cristina España-Bonet
Rachel Bawden
Dawei Zhu
Clement Odoje
Idris Akinade
Iffat Maab
Davis David
Shamsuddeen Hassan Muhammad
Neo Putini
David O. Ademuyiwa
Andrew Caines
Dietrich Klakow
This paper introduces AFRIDOC-MT, a document-level multi-parallel translation dataset covering English and five African languages: Amharic, … (see more)Hausa, Swahili, Yor\`ub\'a, and Zulu. The dataset comprises 334 health and 271 information technology news documents, all human-translated from English to these languages. We conduct document-level translation benchmark experiments by evaluating neural machine translation (NMT) models and large language models (LLMs) for translations between English and these languages, at both the sentence and pseudo-document levels. These outputs are realigned to form complete documents for evaluation. Our results indicate that NLLB-200 achieved the best average performance among the standard NMT models, while GPT-4o outperformed general-purpose LLMs. Fine-tuning selected models led to substantial performance gains, but models trained on sentences struggled to generalize effectively to longer documents. Furthermore, our analysis reveals that some LLMs exhibit issues such as under-generation, repetition of words or phrases, and off-target translations, especially for African languages.
EPISeg: Automated segmentation of the spinal cord on echo planar images using open-access multi-center data
Rohan Banerjee
Merve Kaptan
Alexandra Tinnermann
Ali Khatibi
Alice Dabbagh
Christian W. Kündig
Csw Law
Dario Pfyffer
David J. Lythgoe
Dimitra Tsivaka
Dimitri Van De Ville
Falk Eippert
Fauziyya Muhammad
Gary H. Glover
Gergely David
Grace Haynes
Jan Haaker
Jonathan C. W. Brooks
Jürgen Finsterbusch
Katherine T. Martucci … (see 20 more)
Kimberly J. Hemmerling
Mahdi Mobarak-Abadi
Mark A. Hoggarth
Matthew A. Howard
Molly G. Bright
Nawal Kinany
O. Kowalczyk
Patrick Freund
Robert L. Barry
Sean Mackey
Shahabeddin Vahdat
Simon Schading
Stephen B McMahon
Todd Parish
Véronique Marchand-Pauvert
Yufen Chen
Zachary A. Smith
KA Weber
Benjamin De Leener
Functional magnetic resonance imaging (fMRI) of the spinal cord is relevant for studying sensation, movement, and autonomic function. Prepro… (see more)cessing of spinal cord fMRI data involves segmentation of the spinal cord on gradient-echo echo planar imaging (EPI) images. Current automated segmentation methods do not work well on these data, due to the low spatial resolution, susceptibility artifacts causing distortions and signal drop-out, ghosting, and motion-related artifacts. Consequently, this segmentation task demands a considerable amount of manual effort which takes time and is prone to user bias. In this work, we (i) gathered a multi-center dataset of spinal cord gradient-echo EPI with ground-truth segmentations and shared it on OpenNeuro https://openneuro.org/datasets/ds005143/versions/1.3.0, and (ii) developed a deep learning-based model, EPISeg, for the automatic segmentation of the spinal cord on gradient-echo EPI data. We observe a significant improvement in terms of segmentation quality compared to other available spinal cord segmentation models. Our model is resilient to different acquisition protocols as well as commonly observed artifacts in fMRI data. The training code is available at https://github.com/sct-pipeline/fmri-segmentation/, and the model has been integrated into the Spinal Cord Toolbox as a command-line tool.
EPISeg: Automated segmentation of the spinal cord on echo planar images using open-access multi-center data
Rohan Banerjee
Merve Kaptan
Alexandra Tinnermann
Ali Khatibi
Alice Dabbagh
Christian W. Kündig
Csw Law
Dario Pfyffer
David J. Lythgoe
Dimitra Tsivaka
Dimitri Van De Ville
Falk Eippert
Fauziyya Muhammad
Gary H. Glover
Gergely David
Grace Haynes
Jan Haaker
Jonathan C. W. Brooks
Jürgen Finsterbusch
Katherine T. Martucci … (see 20 more)
Kimberly J. Hemmerling
Mahdi Mobarak-Abadi
Mark A. Hoggarth
Matthew A. Howard
Molly G. Bright
Nawal Kinany
O. Kowalczyk
Patrick Freund
Robert L. Barry
Sean Mackey
Shahabeddin Vahdat
Simon Schading
Stephen B McMahon
Todd Parish
Véronique Marchand-Pauvert
Yufen Chen
Zachary A. Smith
KA Weber
Benjamin De Leener
Functional magnetic resonance imaging (fMRI) of the spinal cord is relevant for studying sensation, movement, and autonomic function. Prepro… (see more)cessing of spinal cord fMRI data involves segmentation of the spinal cord on gradient-echo echo planar imaging (EPI) images. Current automated segmentation methods do not work well on these data, due to the low spatial resolution, susceptibility artifacts causing distortions and signal drop-out, ghosting, and motion-related artifacts. Consequently, this segmentation task demands a considerable amount of manual effort which takes time and is prone to user bias. In this work, we (i) gathered a multi-center dataset of spinal cord gradient-echo EPI with ground-truth segmentations and shared it on OpenNeuro https://openneuro.org/datasets/ds005143/versions/1.3.0, and (ii) developed a deep learning-based model, EPISeg, for the automatic segmentation of the spinal cord on gradient-echo EPI data. We observe a significant improvement in terms of segmentation quality compared to other available spinal cord segmentation models. Our model is resilient to different acquisition protocols as well as commonly observed artifacts in fMRI data. The training code is available at https://github.com/sct-pipeline/fmri-segmentation/, and the model has been integrated into the Spinal Cord Toolbox as a command-line tool.
Open Problems in Machine Unlearning for AI Safety
Fazl Barez
Tingchen Fu
Ameya Prabhu
Stephen Casper
Amartya Sanyal
Adel Bibi
Aidan O'Gara
Robert Kirk
Benjamin Bucknall
Tim Fist
Luke Ong
Philip H. S. Torr
Kwok-Yan Lam
Robert F. Trager
Sören Mindermann
Jose Hernandez-Orallo
Mor Geva
Yarin Gal
As AI systems become more capable, widely deployed, and increasingly autonomous in critical areas such as cybersecurity, biological research… (see more), and healthcare, ensuring their safety and alignment with human values is paramount. Machine unlearning -- the ability to selectively forget or suppress specific types of knowledge -- has shown promise for privacy and data removal tasks, which has been the primary focus of existing research. More recently, its potential application to AI safety has gained attention. In this paper, we identify key limitations that prevent unlearning from serving as a comprehensive solution for AI safety, particularly in managing dual-use knowledge in sensitive domains like cybersecurity and chemical, biological, radiological, and nuclear (CBRN) safety. In these contexts, information can be both beneficial and harmful, and models may combine seemingly harmless information for harmful purposes -- unlearning this information could strongly affect beneficial uses. We provide an overview of inherent constraints and open problems, including the broader side effects of unlearning dangerous knowledge, as well as previously unexplored tensions between unlearning and existing safety mechanisms. Finally, we investigate challenges related to evaluation, robustness, and the preservation of safety features during unlearning. By mapping these limitations and open challenges, we aim to guide future research toward realistic applications of unlearning within a broader AI safety framework, acknowledging its limitations and highlighting areas where alternative approaches may be required.
Soup to go: mitigating forgetting during continual learning with model averaging
Anat Kleiman
Jonathan Frankle
Sham M. Kakade
Mansheej Paul
In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earli… (see more)er tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
GNN-based Decentralized Perception in Multirobot Systems for Predicting Worker Actions
Ali Imran
David St-Onge
In industrial environments, predicting human actions is essential for ensuring safe and effective collaboration between humans and robots. T… (see more)his paper introduces a perception framework that enables mobile robots to understand and share information about human actions in a decentralized way. The framework first allows each robot to build a spatial graph representing its surroundings, which it then shares with other robots. This shared spatial data is combined with temporal information to track human behavior over time. A swarm-inspired decision-making process is used to ensure all robots agree on a unified interpretation of the human's actions. Results show that adding more robots and incorporating longer time sequences improve prediction accuracy. Additionally, the consensus mechanism increases system resilience, making the multi-robot setup more reliable in dynamic industrial settings.