Mila is hosting its first quantum computing hackathon on November 21, a unique day to explore quantum and AI prototyping, collaborate on Quandela and IBM platforms, and learn, share, and network in a stimulating environment at the heart of Quebec’s AI and quantum ecosystem.
This new initiative aims to strengthen connections between Mila’s research community, its partners, and AI experts across Quebec and Canada through in-person meetings and events focused on AI adoption in industry.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
Scaling Synthetic Task Generation for Agents via Exploration
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web na… (see more)vigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to
Post-Training Multimodal Large Language Models (MLLMs) to build interactive agents holds promise across domains such as computer-use, web na… (see more)vigation, and robotics. A key challenge in scaling such post-training is lack of high-quality downstream agentic task datasets with tasks that are diverse, feasible, and verifiable. Existing approaches for task generation rely heavily on human annotation or prompting MLLM with limited downstream environment information, which is either costly or poorly scalable as it yield tasks with limited coverage. To remedy this, we present AutoPlay, a scalable pipeline for task generation that explicitly explores interactive environments to discover possible interactions and current state information to synthesize environment-grounded tasks. AutoPlay operates in two stages: (i) an exploration phase, where an MLLM explorer agent systematically uncovers novel environment states and functionalities, and (ii) a task generation phase, where a task generator leverages exploration trajectories and a set of task guideline prompts as context to synthesize diverse, executable, and verifiable tasks. We show AutoPlay generates 20k tasks across 20 Android applications and 10k tasks across 13 applications Ubuntu applications to train mobile-use and computer-use agents. AutoPlay generated tasks enable large-scale task demonstration synthesis without human annotation by employing an MLLM task executor and verifier. This data enables training MLLM-based UI agents that improve success rates up to
In this work, we propose improved task mapping strategies for real-time electric power system simulations on heterogeneous computing cluster… (see more)s, considering both heterogeneous communication links and processing capacities, with a focus on bottleneck objectives. We approach the problem through two complementary models: the bottleneck quadratic semi-assignment problem (BQSAP), which optimizes task configuration for a fixed number of computing nodes while minimizing communication and computation costs; and the variable-size bin packing problem with quadratic communication constraints (Q-VSBPP), which minimizes the required number of computing nodes, valuable for resource provisioning scenarios. We extend the PuLP library to solve approximately both problems, explicitly including communication costs and processing constraints, and formalize the nomenclature and definitions for bottleneck objectives in graph partitioning. This formalization fills a gap in the existing literature and provides a framework for the rigorous analysis and application of task mapping techniques to real-time electric power system simulation. Finally, we provide a quantitative study and benchmark the extended PuLP library with the SCOTCH partitioning library in the context of real-time electromagnetic transient (EMT) simulation task mapping.
2025-09-29
IEEE International Conference on Smart Grid Communications (published)
Understanding the origin of stars within a galaxy - whether formed in-situ or accreted from other galaxies (ex-situ) - is key to constrainin… (see more)g its evolution. Spatially resolving these components provides crucial insights into a galaxy's mass assembly history. We aim to predict the spatial distribution of ex-situ stellar mass fraction in MaNGA galaxies, and to identify distinct assembly histories based on the radial gradients of these predictions in the central regions. We employ a diffusion model trained on mock MaNGA analogs (MaNGIA), derived from the TNG50 cosmological simulation. The model learns to predict the posterior distribution of resolved ex-situ stellar mass fraction maps, conditioned on stellar mass density, velocity, and velocity dispersion gradient maps. After validating the model on an unseen test set from MaNGIA, we apply it to MaNGA galaxies to infer the spatially-resolved distribution of their ex-situ stellar mass fractions - i.e. the fraction of stellar mass in each spaxel originating from mergers. We identify four broad categories of ex-situ mass distributions: flat gradient, in-situ dominated; flat gradient, ex-situ dominated; positive gradient; and negative gradient. The vast majority of MaNGA galaxies fall in the first category - flat gradients with low ex-situ fractions - confirming that in-situ star formation is the main assembly driver for low- to intermediate-mass galaxies. At high stellar masses, the ex-situ maps are more diverse, highlighting the key role of mergers in building the most massive systems. Ex-situ mass distributions correlate with morphology, star-formation activity, stellar kinematics, and environment, indicating that accretion history is a primary factor shaping massive galaxies. Finally, by tracing their assembly histories in TNG50, we link each class to distinct merger scenarios, ranging from secular evolution to merger-dominated growth.
Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.
Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.
Why do Vision Language Models (VLMs), despite success on standard benchmarks, often fail to match human performance on surprisingly simple v… (see more)isual reasoning tasks? While the underlying computational principles are still debated, we hypothesize that a crucial factor is a deficit in visually-grounded serial processing. To test this hypothesis, we compared human and VLM performance across tasks designed to vary serial processing demands in three distinct domains: geometric reasoning, perceptual enumeration, and mental rotation. Tasks within each domain varied serial processing load by manipulating factors such as geometric concept complexity, perceptual individuation load, and transformation difficulty. Across all domains, our results revealed a consistent pattern: decreased VLM accuracy was strongly correlated with increased human reaction time (used as a proxy for serial processing load). As tasks require more demanding serial processing -- whether composing concepts, enumerating items, or performing mental transformations -- the VLM-human performance gap widens reliably. These findings support our hypothesis, indicating that limitations in serial, visually grounded reasoning represent a fundamental bottleneck that distinguishes current VLMs from humans.
Autonomous web agents increasingly operate in multi-step browser workflows, yet widely used benchmarks can misestimate performance due to un… (see more)derspecified goals and brittle checkers—challenges characteristic of normal benchmark maturation rather than flaws in the paradigm. We present WebArena Verified, a reproducible re-evaluation of WebArena that preserves its containerized environments while strengthening measurement. We audit all 812 tasks, repair misaligned evaluations and clarify ambiguous instructions; replace substring matching with type- and normalization-aware comparators; verify backend state for state-changing tasks; and adopt a structured JSON schema with explicit status codes for deterministic scoring. We provide improved results reporting with template-level macro averages, 95\% confidence intervals, and failure-mode breakdowns. We also introduce WebArena Verified Hard, a 137-task subset that retains difficult cases while reducing evaluation cost by 83\%. On the baseline agent we evaluated, it reduces false negatives by approximately 11\%. WebArena Verified remains drop-in compatible with minimal change to existing agents, supporting faithful and comparable progress. We release our code, data, and evaluation tools in our public repository.
Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.
Binary code similarity detection is a core task in reverse engineering. It supports malware analysis and vulnerability discovery by identify… (see more)ing semantically similar code in different contexts. Modern methods have progressed from manually engineered features to vector representations. Hand-crafted statistics (e.g., operation ratios) are interpretable, but shallow and fail to generalize. Embedding-based methods overcome this by learning robust cross-setting representations, but these representations are opaque vectors that prevent rapid verification. They also face a scalability-accuracy trade-off, since high-dimensional nearest-neighbor search requires approximations that reduce precision. Current approaches thus force a compromise between interpretability, generalizability, and scalability. We bridge these gaps using a language model-based agent to conduct structured reasoning analysis of assembly code and generate features such as input/output types, side effects, notable constants, and algorithmic intent. Unlike hand-crafted features, they are richer and adaptive. Unlike embeddings, they are human-readable, maintainable, and directly searchable with inverted or relational indexes. Without any matching training, our method respectively achieves 42% and 62% for recall@1 in cross-architecture and cross-optimization tasks, comparable to embedding methods with training (39% and 34%). Combined with embeddings, it significantly outperforms the state-of-the-art, demonstrating that accuracy, scalability, and interpretability can coexist.