Publications

Scaling Synthetic Task Generation for Agents via Exploration

Ram Ramrakhya

Andrew Szot

Omar Attia

Yuhao Yang

Anh Nguyen

Bogdan Mazoure

Zhe Gan

Harsh Agrawal

Alexander T Toshev

Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web na… (see more)vigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to

2025-09-29

ArXiv (preprint)

doi.org

arxiv.org

Scaling Synthetic Task Generation for Agents via Exploration

Ram Ramrakhya

Andrew Szot

Omar Attia

Yuhao Yang

Anh Nguyen

Bogdan Mazoure

Zhe Gan

Harsh Agrawal

Alexander T Toshev

Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web na… (see more)vigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to

2025-09-29

ArXiv (preprint)

arxiv.org

Task Mapping Strategies for Electric Power System Simulations on Heterogeneous Clusters

Julie Durette

Gunes Karabulut Kurt

Antoine Lesage-Landry

In this work, we propose improved task mapping strategies for real-time electric power system simulations on heterogeneous computing cluster… (see more)s, considering both heterogeneous communication links and processing capacities, with a focus on bottleneck objectives. We approach the problem through two complementary models: the bottleneck quadratic semi-assignment problem (BQSAP), which optimizes task configuration for a fixed number of computing nodes while minimizing communication and computation costs; and the variable-size bin packing problem with quadratic communication constraints (Q-VSBPP), which minimizes the required number of computing nodes, valuable for resource provisioning scenarios. We extend the PuLP library to solve approximately both problems, explicitly including communication costs and processing constraints, and formalize the nomenclature and definitions for bottleneck objectives in graph partitioning. This formalization fills a gap in the existing literature and provides a framework for the rigorous analysis and application of task mapping techniques to real-time electric power system simulation. Finally, we provide a quantitative study and benchmark the extended PuLP library with the SCOTCH partitioning library in the context of real-time electromagnetic transient (EMT) simulation task mapping.

2025-09-29

IEEE International Conference on Smart Grid Communications (published)

doi.org

The spatially-resolved effect of mergers on the stellar mass assembly of MaNGA galaxies

E. Angeloudi

Marc Huertas-Company

Jes'us Falc'on-Barroso

Laurence Perreault-Levasseur

Alexandre Adam

A. Boecker

Understanding the origin of stars within a galaxy - whether formed in-situ or accreted from other galaxies (ex-situ) - is key to constrainin… (see more)g its evolution. Spatially resolving these components provides crucial insights into a galaxy's mass assembly history. We aim to predict the spatial distribution of ex-situ stellar mass fraction in MaNGA galaxies, and to identify distinct assembly histories based on the radial gradients of these predictions in the central regions. We employ a diffusion model trained on mock MaNGA analogs (MaNGIA), derived from the TNG50 cosmological simulation. The model learns to predict the posterior distribution of resolved ex-situ stellar mass fraction maps, conditioned on stellar mass density, velocity, and velocity dispersion gradient maps. After validating the model on an unseen test set from MaNGIA, we apply it to MaNGA galaxies to infer the spatially-resolved distribution of their ex-situ stellar mass fractions - i.e. the fraction of stellar mass in each spaxel originating from mergers. We identify four broad categories of ex-situ mass distributions: flat gradient, in-situ dominated; flat gradient, ex-situ dominated; positive gradient; and negative gradient. The vast majority of MaNGA galaxies fall in the first category - flat gradients with low ex-situ fractions - confirming that in-situ star formation is the main assembly driver for low- to intermediate-mass galaxies. At high stellar masses, the ex-situ maps are more diverse, highlighting the key role of mergers in building the most massive systems. Ex-situ mass distributions correlate with morphology, star-formation activity, stellar kinematics, and environment, indicating that accretion history is a primary factor shaping massive galaxies. Finally, by tracing their assembly histories in TNG50, we link each class to distinct merger scenarios, ranging from secular evolution to merger-dominated growth.

2025-09-29

ArXiv (preprint)

arxiv.org

Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny

Kia Ghods

Declan Campbell

Raja Marjieh

Amogh Joshi

Sreejan Kumar

Jonathan D. Cohen

Taylor Webb

Thomas L. Griffiths

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

2025-09-29

ArXiv (preprint)

arxiv.org

Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny

Kia Ghods

Declan Campbell

Raja Marjieh

Amogh Joshi

Sreejan Kumar

Jonathan D. Cohen 0003

Taylor Webb

Thomas L. Griffiths

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

2025-09-29

ArXiv (preprint)

doi.org

arxiv.org

Visual serial processing deficits explain divergences in human and VLM reasoning

Nicholas Budny

Kia Ghods

Declan Campbell

Raja Marjieh

Amogh Joshi

Sreejan Kumar

Jonathan D. Cohen 0003

Taylor Webb

Thomas L. Griffiths

Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.

2025-09-29

ArXiv (preprint)

arxiv.org

Intrinsic Neural Oscillations Predict Verbal Learning Performance and Encoding Strategy Use

Victor Oswald

Mathieu Landry

Hamza Abdelhedi

Sarah Lippé

Philippe Robaey

Karim Jerbi

2025-09-28

bioRxiv (preprint)

doi.org

WebArena Verified: Reliable Evaluation for Web Agents

Amine El hattami

Megh Thakkar

Nicolas Chapados

Chris Pal

Autonomous web agents increasingly operate in multi-step browser workflows, yet widely used benchmarks can misestimate performance due to un… (see more)derspecified goals and brittle checkers—challenges characteristic of normal benchmark maturation rather than flaws in the paradigm. We present WebArena Verified, a reproducible re-evaluation of WebArena that preserves its containerized environments while strengthening measurement. We audit all 812 tasks, repair misaligned evaluations and clarify ambiguous instructions; replace substring matching with type- and normalization-aware comparators; verify backend state for state-changing tasks; and adopt a structured JSON schema with explicit status codes for deterministic scoring. We provide improved results reporting with template-level macro averages, 95\% confidence intervals, and failure-mode breakdowns. We also introduce WebArena Verified Hard, a 137-task subset that retains difficult cases while reducing evaluation cost by 83\%. On the baseline agent we evaluated, it reduces false negatives by approximately 11\%. WebArena Verified remains drop-in compatible with minimal change to existing agents, supporting faithful and comparable progress. We release our code, data, and evaluation tools in our public repository.

2025-09-28

NeurIPS.cc/2025/Workshop/SEA (poster)

openreview.net

Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon

Steven H. H. Ding

Philippe Charland

Benjamin Fung

2025-09-27

ArXiv (preprint)

doi.org

arxiv.org

Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon

Steven H. H. Ding

Philippe Charland

Benjamin Fung

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.

2025-09-27

ArXiv (preprint)

arxiv.org

Beyond Embeddings: Interpretable Feature Extraction for Binary Code Similarity

Charles E. Gagnon

Steven H. H. Ding

Philippe Charland

Benjamin Fung

Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.

2025-09-27

ArXiv (preprint)

arxiv.org

Speed Science

Leading in a New Era

Supervision Requests

Publications

Speed Science

Leading in a New Era

Supervision Requests

Popular keywords:

Publications