Publications

Scenes partitioning and annotations of Super Mario Bros. levels
Yann Harel
Basile Pinsard
Estimating Individual Tree Height and Species from UAV Imagery
Accurate estimation of forest biomass, a major carbon sink, relies heavily on tree-level traits such as height and species. Unoccupied Aeria… (see more)l Vehicles (UAVs) capturing high-resolution imagery from a single RGB camera offer a cost-effective and scalable approach for mapping and measuring individual trees. We introduce BIRCH-Trees, the first benchmark for individual tree height and species estimation from tree-centered UAV images, spanning three datasets: temperate forests, tropical forests, and boreal plantations. We also present DINOvTree, a unified approach using a Vision Foundation Model (VFM) backbone with task-specific heads for simultaneous height and species prediction. Through extensive evaluations on BIRCH-Trees, we compare DINOvTree against commonly used vision methods, including VFMs, as well as biological allometric equations. We find that DINOvTree achieves top overall results with accurate height predictions and competitive classification accuracy while using only 54% to 58% of the parameters of the second-best approach.
Multitask-Informed Prior for In-Context Learning on Tabular Data: Application to Steel Property Prediction
Bahareh Nikpour
Jack Y. Wei
Sushant Sinha
Xiaoping Ma
Kashif Rehman
Stephen Yue
Accurate prediction of mechanical properties of steel during hot rolling processes, such as Thin Slab Direct Rolling (TSDR), remains challen… (see more)ging due to complex interactions among chemical compositions, processing parameters, and resultant microstructures. Traditional empirical and experimental methodologies, while effective, are often resource-intensive and lack adaptability to varied production conditions. Moreover, most existing approaches do not explicitly leverage the strong correlations among key mechanical properties, missing an opportunity to improve predictive accuracy through multitask learning. To address this, we present a multitask learning framework that injects multitask awareness into the prior of TabPFN--a transformer-based foundation model for in-context learning on tabular data--through novel fine-tuning strategies. Originally designed for single-target regression or classification, we augment TabPFN's prior with two complementary approaches: (i) target averaging, which provides a unified scalar signal compatible with TabPFN's single-target architecture, and (ii) task-specific adapters, which introduce task-specific supervision during fine-tuning. These strategies jointly guide the model toward a multitask-informed prior that captures cross-property relationships among key mechanical metrics. Extensive experiments on an industrial TSDR dataset demonstrate that our multitask adaptations outperform classical machine learning methods and recent state-of-the-art tabular learning models across multiple evaluation metrics. Notably, our approach enhances both predictive accuracy and computational efficiency compared to task-specific fine-tuning, demonstrating that multitask-aware prior adaptation enables foundation models for tabular data to deliver scalable, rapid, and reliable deployment for automated industrial quality control and process optimization in TSDR.
CanViT: Toward Active-Vision Foundation Models
Active computer vision promises efficient, biologically plausible perception through sequential, localized glimpses, but lacks scalable gene… (see more)ral-purpose architectures and pretraining pipelines. As a result, Active-Vision Foundation Models (AVFMs) have remained unexplored. We introduce CanViT, the first task- and policy-agnostic AVFM. CanViT uses scene-relative RoPE to bind a retinotopic Vision Transformer backbone and a spatiotopic scene-wide latent workspace, the canvas. Efficient interaction with this high-capacity working memory is supported by Canvas Attention, a novel asymmetric cross-attention mechanism. We decouple thinking (backbone-level) and memory (canvas-level), eliminating canvas-side self-attention and fully-connected layers to achieve low-latency sequential inference and scalability to large scenes. We propose a label-free active vision pretraining scheme, policy-agnostic passive-to-active dense latent distillation: reconstructing scene-wide DINOv3 embeddings from sequences of low-resolution glimpses with randomized locations, zoom levels, and lengths. We pretrain CanViT-B from a random initialization on 13.2 million ImageNet-21k scenes -- an order of magnitude more than previous active models -- and 1 billion random glimpses, in 166 hours on a single H100. On ADE20K segmentation, a frozen CanViT-B achieves 38.5% mIoU in a single low-resolution glimpse, outperforming the best active model's 27.6% with 19.5x fewer inference FLOPs and no fine-tuning, as well as its FLOP- or input-matched DINOv3 teacher. Given additional glimpses, CanViT-B reaches 45.9% ADE20K mIoU. On ImageNet-1k classification, CanViT-B reaches 81.2% top-1 accuracy with frozen teacher probes. CanViT generalizes to longer rollouts, larger scenes, and new policies. Our work closes the wide gap between passive and active vision on semantic segmentation and demonstrates the potential of AVFMs as a new research axis.
MIOFlow 2.0: A unified framework for inferring cellular stochastic dynamics from single cell and spatial transcriptomics data
Xingzhi Sun
João Felipe Rocha
Brett Phelan
Dhananjay Bhaskar
Yanlei Zhang
D. S. Magruder
Ke Xu
Oluwadamilola Fasina
Mark Gerstein
Natalia Ivanova
Christine L. Chaffer
Understanding cellular trajectories via time-resolved single-cell transcriptomics is vital for studying development, regeneration, and disea… (see more)se. A key challenge is inferring continuous trajectories from discrete snapshots. Biological complexity stems from stochastic cell fate decisions, temporal proliferation changes, and spatial environmental influences. Current methods often use deterministic interpolations treating cells in isolation, failing to capture the probabilistic branching, population shifts, and niche-dependent signaling driving real biological processes. We introduce Manifold Interpolating Optimal-Transport Flow (MIOFlow) 2.0. This framework learns biologically informed cellular trajectories by integrating manifold learning, optimal transport, and neural differential equations. It models three core processes: (1) stochasticity and branching via Neural Stochastic Differential Equations; (2) non-conservative population changes using a learned growth-rate model initialized with unbalanced optimal transport; and (3) environmental influence through a joint latent space unifying gene expression with spatial features like local cell type composition and signaling. By operating in a PHATE-distance matching autoencoder latent space, MIOFlow 2.0 ensures trajectories respect the data's intrinsic geometry. Empirical comparisons show expressive trajectory learning via neural differential equations outperforms existing generative models, including simulation-free flow matching. Validated on synthetic datasets, embryoid body differentiation, and spatially resolved axolotl brain regeneration, MIOFlow 2.0 improves trajectory accuracy and reveals hidden drivers of cellular transitions, like specific signaling niches. MIOFlow 2.0 thus bridges single-cell and spatial transcriptomics to uncover tissue-scale trajectories.
Test-Time Adaptation via Cache Personalization for Facial Expression Recognition in Videos
Masoumeh Sharafi
Muhammad Zeeshan
Soufiane Belharbi
Alessandro L. Koerich
Eric Granger
Facial expression recognition (FER) in videos requires model personalization to capture the considerable variations across subjects. Vision-… (see more)language models (VLMs) offer strong transfer to downstream tasks through image-text alignment, but their performance can still degrade under inter-subject distribution shifts. Personalizing models using test-time adaptation (TTA) methods can mitigate this challenge. However, most state-of-the-art TTA methods rely on unsupervised parameter optimization, introducing computational overhead that is impractical in many real-world applications. This paper introduces TTA through Cache Personalization (TTA-CaP), a cache-based TTA method that enables cost-effective (gradient-free) personalization of VLMs for video FER. Prior cache-based TTA methods rely solely on dynamic memories that store test samples, which can accumulate errors and drift due to noisy pseudo-labels. TTA-CaP leverages three coordinated caches: a personalized source cache that stores source-domain prototypes, a positive target cache that accumulates reliable subject-specific samples, and a negative target cache that stores low-confidence cases as negative samples to reduce the impact of noisy pseudo-labels. Cache updates and replacement are controlled by a tri-gate mechanism based on temporal stability, confidence, and consistency with the personalized cache. Finally, TTA-CaP refines predictions through fusion of embeddings, yielding refined representations that support temporally stable video-level predictions. Our experiments on three challenging video FER datasets, BioVid, StressID, and BAH, indicate that TTA-CaP can outperform state-of-the-art TTA methods under subject-specific and environmental shifts, while maintaining low computational and memory overhead for real-world deployment.
Antibody discovery technology: innovation and outlook from classic to leading edge
PengFei WANG
,
Gamer in the scanner : Event-related analysis of fMRI activity during retro videogame play guided by automated annotations of game content
Yann Harel
Basile Pinsard
Julie A. Boyle
Valentina Borghesani
Paul-Henri Mignot
André Cyr
Abstract In recent years, videogames have gathered interest in cognitive neuroscience for their potential to study cognition in dynamical an… (see more)d naturalistic contexts. Yet, the complexity of game environments often challenges traditional modeling approaches, and current annotation methods—typically manual or based on modified games—remain labor-intensive and limited in scope. Here, we introduce a flexible and scalable framework using the gym-retro Python library to emulate a classic action-platformer, Shinobi III: Return of the Ninja Master (Sega, 1993), and automatically annotate gameplay events directly from the game’s memory states. This setup enables the identification of both player actions (e.g., jumping, hitting) and feedback events (e.g., killing an enemy, being hit), without modifying the game. Four individuals played the videogame for a combined total of 32 hours (>7 hours each) while undergoing functional magnetic resonance imaging (fMRI). Resulting activation maps revealed distributed engagement of visual, motor, executive, and limbic systems, consistent with the cognitive demands of gameplay. Within-participant reproducibility of brain responses across sessions was robust across event types (r ≈ .25–.55), with some consistency observed even for rarer events like HealthLoss. Between-participant correlations were notably lower, reflecting participant-specific neural signatures. Multivoxel pattern analysis showed that brain responses to different in-game events were highly discriminable, with classification accuracy typically around or above 90%, though occasionally dropping to ~40% for less frequent events. These findings demonstrate that automated emulator-based annotations enable robust, interpretable, and scalable mapping of naturalistic cognitive processes using commercial videogames.
Ca2+ transient detection and segmentation with the Astronomically motivated algorithm for Background Estimation And Transient Segmentation (Astro-BEATS)
Bi Fan
Anthony Bilodeau
Theresa Wiesner
Renée Hložek
Fluorescence-based Ca…
Distinct SMA beta bursts support the development of anticipatory postural control in children
Viktoriya O. Manyukhina
Fanny Barlaam
Judith Vergne
Anaëlle Bain
Oussama Abdoun
Sébastien Daligault
Claude Delpuech
Sandrine Sonié
Mathilde Bonnefond
C. Schmitz
Abstract To compensate for self-generated movement-induced postural disturbances, the brain generates anticipatory postural adjustments (APA… (see more)), ensuring smooth, coordinated actions. APA development continues into late adolescence, yet the specific pathways and mechanisms that remain immature in children are poorly understood. We studied APA mechanisms in 24 children (7-12 years old) using magnetoencephalography (MEG) while they performed the naturalistic bimanual load-lifting task (BLLT). In the BLLT, participants lift a load placed on one forearm with the contralateral hand while keeping the postural forearm horizontal, as if lifting a glass from a tray. To counteract forearm deflection caused by unloading, the brain generates APAs, which involve anticipatory inhibition of the postural Biceps brachii . We found that stronger anticipatory Biceps brachii inhibition was associated with reduced excitability, as indexed by high-gamma (90-130 Hz) suppression, and increased high-beta power (19-29 Hz) in the contralateral Supplementary Motor Area (SMA). Analysis of transient beta events revealed two functionally distinct burst types: (1) 19-24 Hz bursts: time-locked to immediate high-gamma suppression correlated with 26-28 Hz beta power; predicted stronger muscle inhibition and received directed input from middle frontal cortex and precentral gyrus; (2) 24-29 Hz bursts: linked to delayed (∼100 ms) high-gamma suppression correlated with 8 Hz alpha power; predicted earlier and prolonged muscle inhibition and better forearm stabilization, but did not show directional influence from other regions. Results on anticipatory inhibition-related beta bursts replicated mechanisms reported in adults, suggesting that the efferent pathways and transient inhibitory processes underlying APA may already be mature in children. In contrast, higher-frequency beta bursts revealed a child-specific, complementary APA mechanism that may compensate for imprecise anticipatory inhibition. These results reveal two oscillatory mechanisms supporting APA in children and indicate that beta bursts may reflect both immediate cortical inhibition linked to muscle control and indirect alpha-mediated inhibition likely compensating for forearm instability.
ICLAD: In-Context Learning for Unified Tabular Anomaly Detection Across Supervision Regimes
Jack Yi Wei
Anomaly detection on tabular data is commonly studied under three supervision regimes, including one-class settings that assume access to an… (see more)omaly-free training samples, fully unsupervised settings with unlabeled and potentially contaminated training data, and semi-supervised settings with limited anomaly labels. Existing deep learning approaches typically train dataset-specific models under the assumption of a single supervision regime, which limits their ability to leverage shared structures across anomaly detection tasks and to adapt to different supervision levels. We propose ICLAD, an in-context learning foundation model for tabular anomaly detection that generalizes across both datasets and supervision regimes. ICLAD is trained via meta-learning on synthetic tabular anomaly detection tasks, and at inference time, the model assigns anomaly scores by conditioning on the training set without updating model weights. Comprehensive experiments on 57 tabular datasets from ADBench show that our method achieves state-of-the-art performance across three supervision regimes, establishing a unified framework for tabular anomaly detection.
Listen First, Then Answer: Timestamp-Grounded Speech Reasoning
Large audio-language models (LALMs) can generate reasoning chains for their predictions, but it remains unclear whether these reasoning chai… (see more)ns remain grounded in the input audio. In this paper, we propose an RL-based strategy that grounds the reasoning outputs of LALMs with explicit timestamp annotations referring to relevant segments of the audio signal. Our analysis shows that timestamp grounding leads the model to attend more strongly to audio tokens during reasoning generation. Experiments on four speech-based benchmark datasets demonstrate that our approach improves performance compared to both zero-shot reasoning and fine-tuning without timestamp grounding. Additionally, grounding amplifies desirable reasoning behaviors, such as region exploration, audiology verification, and consistency, underscoring the importance of grounding mechanisms for faithful multimodal reasoning.