The Mila AI Policy Fellowship translates deep AI expertise into rigorous, public-interest policy. Read the newest publication Bridging the Expertise Gap: Knowledge Transfer Mechanisms for AI Regulation by Moritz von Knebel
This program supports AI startups at any time of the year. Benefit from cutting-edge resources and tailored support to accelerate your technology's development.
We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection
Understanding the impact of IoT security patterns on CPU usage and energy consumption: a dynamic approach for selecting patterns with deep reinforcement learning
Large Language Models (LLMs) are widely adopted for automated code generation with promising results. Although prior research has assessed L… (see more)LM-generated code and identified various quality issues- such as redundancy, poor maintainability, and sub-optimal performance- a systematic understanding and categorization of these inefficiencies remain unexplored. Therefore, we empirically investigate inefficiencies in LLM-generated Python code by state-of-the-art models, i.e., CodeLlama, DeepSeek-Coder, and CodeGemma. To do so, we manually analyze 492 generated Python code snippets in the HumanEval+ dataset. We then construct a taxonomy of inefficiencies in LLM-generated Python code that includes 5 categories (General Logic, Performance, Readability, Maintainability, and Errors) and 19 subcategories of inefficiencies. We validate the obtained taxonomy through an online survey with 58 LLM practitioners and researchers. The surveyed participants affirmed the completeness of the proposed taxonomy, and the relevance and the popularity of the identified code inefficiency patterns. Our qualitative findings indicate that inefficiencies are diverse and interconnected, affecting multiple aspects of code quality, with logic and performance-related inefficiencies being the most frequent and often co-occurring while impacting overall code quality. Our taxonomy provides a structured basis for evaluating the quality of LLM-generated code and guiding future research to improve code generation efficiency.
We introduce NNetscape Navigator (NNetnav), a method for training web agents entirely through synthetic demonstrations. These demonstrations… (see more) are collected by first interacting with a browser to generate trajectory rollouts, which are then retroactively labeled into instructions using a language model. Most work on training browser agents has relied on expensive human supervision, and the limited previous work on such interaction-first synthetic data techniques has failed to provide effective search through the exponential space of exploration. In contrast, NNetnav exploits the hierarchical structure of language instructions to make this search more tractable: complex instructions are typically decomposable into simpler subtasks, allowing NNetnav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. We use NNetnav demonstrations from a language model for supervised fine-tuning of a smaller language model policy, and find improvements of 6 points on WebArena and over 20 points on MiniWoB++, two popular environments for web-agents. Notably, on WebArena, we observe that language model policies can be further enhanced when fine-tuned with NNetnav demonstrations derived from the same language model. Finally, we collect and release a dataset of over 6k NNetnav demonstrations on WebArena, spanning a diverse and complex set of instructions.
Recent advances in integrating positional and structural encodings (PSEs) into graph neural networks (GNNs) have significantly enhanced thei… (see more)r performance across various graph learning tasks. However, the general applicability of these encodings and their potential to serve as foundational representations for graphs remain uncertain. This paper investigates the fine-tuning efficiency, scalability with sample size, and generalization capability of learnable PSEs across diverse graph datasets. Specifically, we evaluate their potential as universal pre-trained models that can be easily adapted to new tasks with minimal fine-tuning and limited data. Furthermore, we assess the expressivity of the learned representations, particularly, when used to augment downstream GNNs. We demonstrate through extensive benchmarking and empirical analysis that PSEs generally enhance downstream models. However, some datasets may require specific PSE-augmentations to achieve optimal performance. Nevertheless, our findings highlight their significant potential to become integral components of future graph foundation models. We provide new insights into the strengths and limitations of PSEs, contributing to the broader discourse on foundation models in graph learning.
Earthquake monitoring is a fundamental task to unravel the underlying physics of earthquakes and mitigate associated hazards for public safe… (see more)ty. Distributed acoustic sensing, or DAS, which transforms pre-existing telecommunication cables into ultra-dense seismic networks, offers a cost-effective and scalable solution for next-generation earthquake monitoring. However, current approaches for earthquake monitoring like PhaseNet and PhaseNet-2 primarily rely on supervised learning, while manually labeled DAS data is quite limited and it is difficult to obtain more annotated datasets. In this paper, we present DASFormer, a novel self-supervised pretraining technique on DAS data with a coarse-to-fine framework that models spatial-temporal signal correlation. We treat earthquake monitoring as an anomaly detection task and demonstrate DASFormer can be directly utilized as a seismic phase detector. Experimental results demonstrate that DASFormer is effective in terms of several evaluation metrics and outperforms state-of-the-art time-series forecasting, anomaly detection, and foundation models on the unsupervised seismic detection task. We also demonstrate the potential of fine-tuning DASFormer to downstream tasks through case studies.
GradTune: Last-layer Fine-tuning for Group Robustness Without Group Annotation
Patrik Joslin Kenfack
Ulrich Matchi Aïvodji
S Ebrahimi Kahou
This work addresses the limitations of deep neural networks (DNNs) in generalizing beyond training data due to spurious correlations. Recent… (see more) research has demonstrated that models trained with empirical risk minimization learn both core and spurious features, often upweighting spurious ones in the final classification, which can frequently lead to poor performance on minority groups. Deep Feature Reweighting alleviates this issue by retraining the model's last classification layer using a group-balanced held-out validation set. However, relying on spurious feature labels during training or validation limits practical application, as spurious features are not always known or costly to annotate. Our preliminary experiments reveal that ERM-trained models exhibit higher gradient norms on minority group samples in the hold-out dataset. Leveraging these insights, we propose an alternative approach called GradTune, which fine-tunes the last classification layer using high-gradient norm samples. Our results on four well-established benchmarks demonstrate that the proposed method can achieve competitive performance compared to existing methods without requiring group labels during training or validation.