Training Compute-Optimal Vision Transformers for Brain Encoding
Sana Ahmadi
Fraçois Paugam
Tristan Glatard
The optimal training of a vision transformer for brain encoding depends on three factors: model size, data size, and computational resources… (voir plus). This study investigates these three pillars, focusing on the effects of data scaling, model scaling, and high-performance computing on brain encoding results. Using VideoGPT to extract efficient spatiotemporal features from videos and training a Ridge model to predict brain activity based on these features, we conducted benchmark experiments with varying data sizes (10k, 100k, 1M, 6M) and different model configurations of GPT-2, including hidden layer dimensions, number of layers, and number of attention heads. We also evaluated the effects of training models with 32-bit vs 16-bit floating point representations. Our results demonstrate that increasing the hidden layer dimensions significantly improves brain encoding performance, as evidenced by higher Pearson correlation coefficients across all subjects. In contrast, the number of attention heads does not have a significant effect on the encoding results. Additionally, increasing the number of layers shows some improvement in brain encoding correlations, but the trend is not as consistent as that observed with hidden layer dimensions. The data scaling results show that larger training datasets lead to improved brain encoding performance, with the highest Pearson correlation coefficients observed for the largest dataset size (6M). These findings highlight that the effects of data scaling are more significant compared to model scaling in enhancing brain encoding performance. Furthermore, we explored the impact of floating-point precision by comparing 32-bit and 16-bit representations. Training with 16-bit precision yielded the same brain encoding accuracy as 32-bit, while reducing training time by 1.17 times, demonstrating its efficiency for high-performance computing tasks.
BlabberSeg: Real-Time Embedded Open-Vocabulary Aerial Segmentation
Haechan Mark Bong
Ricardo de Azambuja
Real-time aerial image segmentation plays an important role in the environmental perception of Uncrewed Aerial Vehicles (UAVs). We introduce… (voir plus) BlabberSeg, an optimized Vision-Language Model built on CLIPSeg for on-board, real-time processing of aerial images by UAVs. BlabberSeg improves the efficiency of CLIPSeg by reusing prompt and model features, reducing computational overhead while achieving real-time open-vocabulary aerial segmentation. We validated BlabberSeg in a safe landing scenario using the Dynamic Open-Vocabulary Enhanced SafE-Landing with Intelligence (DOVESEI) framework, which uses visual servoing and open-vocabulary segmentation. BlabberSeg reduces computational costs significantly, with a speed increase of 927.41% (16.78 Hz) on a NVIDIA Jetson Orin AGX (64GB) compared with the original CLIPSeg (1.81Hz), achieving real-time aerial segmentation with negligible loss in accuracy (2.1% as the ratio of the correctly segmented area with respect to CLIPSeg). BlabberSeg's source code is open and available online.
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Huang Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse
Ekansh Sharma
Daniel M. Roy
Model merging aims to efficiently combine the weights of multiple expert models, each trained on a specific task, into a single multi-task m… (voir plus)odel, with strong performance across all tasks. When applied to all but the last layer of weights, existing methods -- such as Task Arithmetic, TIES-merging, and TALL mask merging -- work well to combine expert models obtained by fine-tuning a common foundation model, operating within a"local"neighborhood of the foundation model. This work explores the more challenging scenario of"non-local"merging, which we find arises when an expert model changes significantly during pretraining or where the expert models do not even share a common foundation model. We observe that standard merging techniques often fail to generalize effectively in this non-local setting, even when accounting for permutation symmetries using standard techniques. We identify that this failure is, in part, due to"variance collapse", a phenomenon identified also in the setting of linear mode connectivity by Jordan et al. (2023). To address this, we propose a multi-task technique to re-scale and shift the output activations of the merged model for each task, aligning its output statistics with those of the corresponding task-specific expert models. Our experiments demonstrate that this correction significantly improves the performance of various model merging approaches in non-local settings, providing a strong baseline for future research on this problem.
The Non-Local Model Merging Problem: Permutation Symmetries and Variance Collapse
Ekansh Sharma
Daniel M. Roy
WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines
Genta Indra Winata
Frederikus Hudi
Patrick Amadeus Irawan
David Anugraha
Rifki Afina Putri
Yutong Wang
Adam Nohejl
Ubaidillah Ariq Prathama
Nedjma OUSIDHOUM
Afifa Amriani
Anar Rzayev
Anirban Das
Ashmari Pramodya
Aulia Adila
Bryan Wilie
Candy Olivia Mawalim
Ching Lam Cheng
Daud Abolade
Emmanuele Chersoni
Enrico Santus … (voir 31 de plus)
Fariz Ikhwantri
Garry Kuwanto
Hanyang Zhao
Haryo Akbarianto Wibowo
Holy Lovenia
Jan Christian Blaise Cruz
Jan Wira Gotama Putra
Junho Myung
Lucky Susanto
Maria Angelica Riera Machin
Marina Zhukova
Michael Anugraha
Muhammad Farid Adilazuarda
Natasha Santosa
Peerat Limkonchotiwat
Raj Dabre
Rio Alexander Audino
Samuel Cahyawijaya
Shi-Xiong Zhang
Stephanie Yulia Salim
Yi Zhou
Yinxuan Gui
En-Shiun Annie Lee
Shogo Okada
Ayu Purwarianti
Alham Fikri Aji
Taro Watanabe
Derry Tanti Wijaya
Alice Oh
Chong-Wah Ngo
Adversarial Bounding Boxes Generation (ABBG) Attack against Visual Object Trackers
Fatemeh Nourilenjan Nokabadi
Jean-Francois Lalonde
Adversarial perturbations aim to deceive neural networks into predicting inaccurate results. For visual object trackers, adversarial attacks… (voir plus) have been developed to generate perturbations by manipulating the outputs. However, transformer trackers predict a specific bounding box instead of an object candidate list, which limits the applicability of many existing attack scenarios. To address this issue, we present a novel white-box approach to attack visual object trackers with transformer backbones using only one bounding box. From the tracker predicted bounding box, we generate a list of adversarial bounding boxes and compute the adversarial loss for those bounding boxes. Experimental results demonstrate that our simple yet effective attack outperforms existing attacks against several robust transformer trackers, including TransT-M, ROMTrack, and MixFormer, on popular benchmark tracking datasets such as GOT-10k, UAV123, and VOT2022STS.
Learning to Forget using Hypernetworks
Jose Miguel Lara Rangel
Usman Anwar
Stefan Schoepf
Jack Foster
Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to … (voir plus)comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a trained model while maintaining performance on the remaining data. This paper introduces HyperForget, a novel machine unlearning framework that leverages hypernetworks– neural networks that generate parameters for other networks– to dynamically sample models that lack knowledge of targeted data while preserving essential capabilities. Leveraging diffusion models, we implement two Diffusion HyperForget Networks and used them to sample unlearned models in Proof-of-Concept experiments. The unlearned models obtained zero accuracy on the forget set, while preserving good accuracy on the retain sets, highlighting the potential of HyperForget for dynamic targeted data removal and a promising direction for developing adaptive machine unlearning algorithms.
Structure-function coupling and decoupling during movie-watching and resting-state: Novel insights bridging EEG and structural imaging
Venkatesh Subramani
Giulia Lioi
Nicolas Farrugia
The intricate structural and functional architecture of the brain enables a wide range of cognitive processes ranging from perception and ac… (voir plus)tion to higher-order abstract thinking. Despite important progress, the relationship between the brain’s structural and functional properties is not yet fully established. In particular, the way the brain’s anatomy shapes its electrophysiological dynamics remains elusive. The electroencephalography (EEG) activity recorded during naturalistic tasks is thought to exhibit patterns of coupling with the underlying brain structure that vary as a function of behavior. Yet these patterns have not yet been sufficiently quantified. We address this gap by jointly examining individual Diffusion-Weighted Imaging (DWI) scans and continuous EEG recorded during video-watching and resting state, using a Graph Signal Processing (GSP) framework. By decomposing the structural graph into Eigenmodes and expressing the EEG activity as an extension of anatomy, GSP provides a way to quantify the structure-function coupling. We elucidate how the structure shapes function during naturalistic tasks such as movie-watching and how this association is modulated by tasks. We quantify the coupling relationship in a region-, time-, frequency-resolved manner. First of all, our findings indicate that the EEG activity in the sensorimotor cortex is strongly coupled with brain structure, while the activity in higher-order systems is less constrained by anatomy, i.e., shows more flexibility. In addition, we found that watching videos was associated with stronger structure-function coupling in the sensorimotor cortex, as compared to resting-state data. Second, time-resolved analysis revealed that the unimodal systems undergo minimal temporal fluctuation in structure-function association, and the transmodal system displays highest temporal fluctuations, with the exception of PCC seeing low fluctuations. Lastly, our frequency-resolved analysis revealed a consistent topography across different EEG rhythms, suggesting a similar relationship with the anatomical structure across frequency bands. Together, this unprecedented characterization of the link between structure and function using continuous EEG during naturalistic behavior underscores the role of anatomy in shaping ongoing cognitive processes. Taken together, by combining the temporal and spectral resolution of EEG and the methodological advantages of GSP, our work sheds new light onto the anatomo-functional organization of the brain.
Structure-function coupling and decoupling during movie-watching and resting-state: Novel insights bridging EEG and structural imaging
Venkatesh Subramani
Giulia Lioi
Nicolas Farrugia
TrackPGD: Efficient Adversarial Attack using Object Binary Masks against Robust Transformer Trackers
Fatemeh Nourilenjan Nokabadi
Yann Batiste Pequignot
Jean-Francois Lalonde
TrackPGD: Efficient Adversarial Attack using Object Binary Masks against Robust Transformer Trackers
Fatemeh Nourilenjan Nokabadi
Yann Batiste Pequignot
Jean-Francois Lalonde
Adversarial perturbations can deceive neural networks by adding small, imperceptible noise to the input. Recent object trackers with transfo… (voir plus)rmer backbones have shown strong performance on tracking datasets, but their adversarial robustness has not been thoroughly evaluated. While transformer trackers are resilient to black-box attacks, existing white-box adversarial attacks are not universally applicable against these new transformer trackers due to differences in backbone architecture. In this work, we introduce TrackPGD, a novel white-box attack that utilizes predicted object binary masks to target robust transformer trackers. Built upon the powerful segmentation attack SegPGD, our proposed TrackPGD effectively influences the decisions of transformer-based trackers. Our method addresses two primary challenges in adapting a segmentation attack for trackers: limited class numbers and extreme pixel class imbalance. TrackPGD uses the same number of iterations as other attack methods for tracker networks and produces competitive adversarial examples that mislead transformer and non-transformer trackers such as MixFormerM, OSTrackSTS, TransT-SEG, and RTS on datasets including VOT2022STS, DAVIS2016, UAV123, and GOT-10k.