Publications

ClustRecNet: A Novel End-to-End Deep Learning Framework for Clustering Algorithm Recommendation
Mohammadreza Bakhtyari
Renato Cordeiro De Amorim
A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy
Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations … (see more)was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model's ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.
High-order Component Attribution via Kolmogorov-Arnold Networks
Component attribution methods provide insight into how parts of deep learning models, such as convolutional filters and attention heads, inf… (see more)luence model predictions. Despite their successes, existing attribution approaches typically assume component effects are additive and independent, neglecting complex interactions among components. Capturing these relations between components is crucial for a better mechanistic understanding of these models. In this work, we improve component attribution (COAR) by replacing the linear counterfactual estimator with a Kolmogorov–Arnold Network (KAN) surrogate fitted to example‑wise perturbation–response data. Then, a symbolic approximation of the learned KAN lets us compute mixed partial derivatives that captures and makes explicit high‑order component interactions that linear methods are missing. These symbolic expressions facilitate future integration with formal verification methods, enabling richer counterfactual analyses of internal model behavior. Preliminary results on standard image classification models demonstrate that our approach improves the accuracy of predicted counterfactuals and enable extraction of higher-order component interactions compared to linear attribution methods.
Robust Fine-Tuning from Non-Robust Pretrained Models: Mitigating Suboptimal Transfer With Epsilon-Scheduling
Yann Batiste Pequignot
Frédéric Precioso
Fine-tuning pretrained models is the standard approach in current machine learning practice, but simultaneously achieving adversarial robust… (see more)ness to adversarial examples remains a challenge. Despite the abundance of non-robust pretrained models in open-source repositories, their use for Robust Fine-Tuning (RFT) remains understudied. This work aims to bridge this knowledge gap by systematically examining RFT from such models. Our experiments reveal that fine-tuning non-robust models with a robust objective, even under small perturbations, can lead to poor performance, a phenomenon that we dub \emph{suboptimal transfer}. In fact, we find that fine-tuning using a robust objective impedes task alignment at the beginning of training and eventually prevents optimal transfer. To promote optimal transfer, we propose \emph{Epsilon-Scheduling}, a simple heuristic scheduling over perturbation strength. Additionally, we introduce \emph{expected robustness}, a metric that measures performance across a range of perturbations. Experiments on six pretrained models and five datasets show that \emph{Epsilon-Scheduling} prevents \emph{suboptimal transfer} and consistently improves the expected robustness.
Scaling Synthetic Task Generation for Agents via Exploration
Ram Ramrakhya
Andrew Szot
Omar Attia
Yuhao Yang
Anh Nguyen
Zhe Gan
Harsh Agrawal
Alexander T Toshev
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web na… (see more)vigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to
Towards Understanding the Impact of Data Bugs on Deep Learning Models in Software Engineering
Mehil B. Shah
Mohammad Masudur Rahman
Deep learning (DL) techniques have achieved significant success in various software engineering tasks (e.g., code completion by Copilot). Ho… (see more)wever, DL systems are prone to bugs from many sources, including training data. Existing literature suggests that bugs in training data are highly prevalent, but little research has focused on understanding their impacts on the models used in software engineering tasks. In this paper, we address this research gap through a comprehensive empirical investigation focused on three types of data prevalent in software engineering tasks: code-based, text-based, and metric-based. Using state-of-the-art baselines, we compare the models trained on clean datasets with those trained on datasets with quality issues and without proper preprocessing. By analysing the gradients, weights, and biases from neural networks under training, we identify the symptoms of data quality and preprocessing issues. Our analysis reveals that quality issues in code data cause biased learning and gradient instability, whereas problems in text data lead to overfitting and poor generalisation of models. On the other hand, quality issues in metric data result in exploding gradients and model overfitting, and inadequate preprocessing exacerbates these effects across all three data types. Finally, we demonstrate the validity and generalizability of our findings using six new datasets. Our research provides a better understanding of the impact and symptoms of data bugs in software engineering datasets. Practitioners and researchers can leverage these findings to develop better monitoring systems and data-cleaning methods to help detect and resolve data bugs in deep learning systems.
Visual serial processing deficits explain divergences in human and VLM reasoning
Nicholas Budny
Kia Ghods
Declan Campbell
Raja Marjieh
Amogh Joshi
Sreejan Kumar
Jonathan D. Cohen 0003
Thomas L. Griffiths
Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.
Intrinsic Neural Oscillations Predict Verbal Learning Performance and Encoding Strategy Use
Victor Oswald
Mathieu Landry
Sarah Lippé
Philippe Robaey
WebArena Verified: Reliable Evaluation for Web Agents
Amine El hattami
Christopher Pal
Autonomous web agents increasingly operate in multi-step browser workflows, yet widely used benchmarks can misestimate performance due to un… (see more)derspecified goals and brittle checkers—challenges characteristic of normal benchmark maturation rather than flaws in the paradigm. We present WebArena Verified, a reproducible re-evaluation of WebArena that preserves its containerized environments while strengthening measurement. We audit all 812 tasks, repair misaligned evaluations and clarify ambiguous instructions; replace substring matching with type- and normalization-aware comparators; verify backend state for state-changing tasks; and adopt a structured JSON schema with explicit status codes for deterministic scoring. We provide improved results reporting with template-level macro averages, 95\% confidence intervals, and failure-mode breakdowns. We also introduce WebArena Verified Hard, a 137-task subset that retains difficult cases while reducing evaluation cost by 83\%. On the baseline agent we evaluated, it reduces false negatives by approximately 11\%. WebArena Verified remains drop-in compatible with minimal change to existing agents, supporting faithful and comparable progress. We release our code, data, and evaluation tools in our public repository.
Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity
Charles E. Gagnon
Steven H. H. Ding
Philippe Charland
Benjamin C. M. Fung
Planner Aware Path Learning in Diffusion Language Models Training
Fred Zhangzhi Peng
Zachary Bezemek
Shuibai Zhang
Anru R. Zhang
Michael M. Bronstein
Avishek Bose
Planning with Unified Multimodal Models
Zhilong Zhang
Yang Yu
With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored … (see more)using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on long-horizon planning tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.