Revisiting the Goldilocks Zone in Inhomogeneous Networks
Zacharie Garnier Cuchet
We investigate how architectural inhomogeneities—such as biases, layer normalization, and residual connections—affect the curvature of t… (voir plus)he loss landscape at initialization and its link to trainability. We focus on the Goldilocks zone, a region in parameter space with excess positive curvature, previously associated with improved optimization in homogeneous networks. To extend this analysis, we compare two scaling strategies: weight scaling and softmax temperature scaling. Our results show that in networks with biases or residual connections, both strategies identify a Goldilocks zone aligned with better training. In contrast, layer normalization leads to lower or negative curvature, yet stable optimization—revealing a disconnect between curvature and trainability. Softmax temperature scaling behaves more consistently across models, making it a more robust probe. Overall, the Goldilocks zone remains relevant in inhomogeneous networks, but its geometry and predictive power depend on architectural choices, particularly normalization.
Spaced Scheduling for Large Language Model Training
Amine El hattami
TGM: A Modular Framework for Machine Learning on Temporal Graphs
While deep learning on static graphs has been revolutionized by standardized libraries like PyTorch Geometric and DGL, machine learning on T… (voir plus)emporal Graphs (TG), networks that evolve over time, lacks comparable software infrastructure. Existing TG libraries are limited in scope, focusing on a single method category or specific algorithms. We introduce Temporal Graph Modelling (TGM), a comprehensive framework for machine learning on temporal graphs to address this gap. Through a modular architecture, TGM is the first library to support both discrete and continuous-time TG methods and implements a wide range of TG methods. The TGM framework combines an intuitive front-end API with an optimized backend storage, enabling reproducible research and efficient experimentation at scale. Key features include graph-level optimizations for offline training and built-in performance profiling capabilities. Through extensive benchmarking on five real-world networks, TGM is up to 6 times faster than the widely used DyGLib library on TGN and TGAT models and up to 8 times faster than the UTG framework for converting edges into coarse-grained snapshots.
Towards Fair In-Context Learning with Tabular Foundation Models
Patrik Joslin Kenfack
Tabular foundational models have shown promising in-context learning capabilities on structured data by using training examples as context w… (voir plus)ithout further parameter adjustments. This emerging approach positions itself as a competitive alternative to traditional gradient-boosted tree methods. However, while biases in conventional machine learning models are well documented, it remains unclear how these biases manifest in Tabular ICL. The paper investigates the fairness implications of Tabular ICL and explores three preprocessing strategies—correlation removal, group-balanced demonstration selection, and uncertainty-based demonstration selection—to address bias. Comprehensive experiments indicate that uncertainty-based demonstration selection consistently enhances group fairness in the predictions. The source code for reproducing the results of this work can be found at https://anonymous.4open.science/r/Fair-TabICL-DD84.
Towards Fair In-Context Learning with Tabular Foundation Models
Patrik Joslin Kenfack
Tabular foundational models have shown promising in-context learning capabilities on structured data by using training examples as context w… (voir plus)ithout further parameter adjustments. This emerging approach positions itself as a competitive alternative to traditional gradient-boosted tree methods. However, while biases in conventional machine learning models are well documented, it remains unclear how these biases manifest in Tabular ICL. The paper investigates the fairness implications of Tabular ICL and explores three preprocessing strategies—correlation removal, group-balanced demonstration selection, and uncertainty-based demonstration selection—to address bias. Comprehensive experiments indicate that uncertainty-based demonstration selection consistently enhances group fairness in the predictions. The source code for reproducing the results of this work can be found at https://anonymous.4open.science/r/Fair-TabICL-DD84.
Tracing the representation geometry of language models from pretraining to post-training
Melody Zixuan Li
Kumar Krishna Agrawal
Komal Kumar Teru
The geometry of representations in a neural network can significantly impact downstream generalization. It is unknown how representation geo… (voir plus)metry changes in large language models (LLMs) over pretraining and post-training. Here, we characterize the evolving geometry of LLM representations using spectral methods (effective rank and eigenspectrum decay). With the OLMo and Pythia model families we uncover a consistent non-monotonic sequence of three distinct geometric phases in pretraining. An initial \warmup phase sees rapid representational compression. This is followed by an "entropy-seeking" phase, characterized by expansion of the representation manifold's effective dimensionality, which correlates with an increase in memorization. Subsequently, a "compression seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, correlating with improved downstream task performance. We link the emergence of these phases to the fundamental interplay of cross-entropy optimization, information bottleneck, and skewed data distribution. Additionally, we find that in post-training the representation geometry is further transformed: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) correlate with another "entropy-seeking" dynamic to integrate specific instructional or preferential data, reducing out-of-distribution robustness. Conversely, Reinforcement Learning with Verifiable Rewards (RLVR) often exhibits a "compression seeking" dynamic, consolidating reward-aligned behaviors and reducing the entropy in its output distribution. This work establishes the utility of spectral measures of representation geometry for understanding the multiphase learning dynamics within LLMs.
Two-point deterministic equivalence for SGD in random feature models
Alexander Atanasov
Blake Bordelon
Jacob A Zavatone-Veth
Cengiz Pehlevan
Ultrasound and MRI-based evaluation of relationships between morphological and mechanical properties of the lower lumbar multifidus muscle in chronic low back pain.
Neda Naghdi
Sara Masi
Cléo Bertrand
Brent Rosenstein
Hassan Rivaz
Mathieu Roy
Maryse Fortin
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Mihir Bansal
Chandra Kiran Reddy Evuru
Gabriel Huang
Abhay Puri
Avinandan Bose
Maryam Fazel
Jason Stanley
Alexandre Lacoste
Krishnamurthy Dj Dvijotham
We present DoomArena, a security evaluation framework for AI agents. DoomArena is designed on three principles: 1) It is a plug-in framework… (voir plus) and integrates easily into realistic agentic frameworks like BrowserGym (for web agents) and
How to Train Your LLM Web Agent: A Statistical Diagnosis
Hadi Nekoei
Nicolas Gontier
Miguel Muñoz-Mármol
Stefania Raimondo
Alexandre Piché
Alexandre Lacoste
Massimo Caccia
LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems, widening the gap with op… (voir plus)en-source alternatives. Progress has been held back by two key challenges: first, a narrow focus on single-step tasks that overlooks the complexity of multi-step web interactions; and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via supervised fine-tuning (SFT), followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices, making exhaustive sweeps impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy requires only 55% of the compute to match the peak performance of pure SFT on MiniWob++, effectively pushing the compute-performance Pareto frontier, and is the only strategy that can close the gap with closed-source models.
How to Train Your LLM Web Agent: A Statistical Diagnosis
Hadi Nekoei
Nicolas Gontier
Miguel Muñoz-Mármol
Stefania Raimondo
Alexandre Piché
Alexandre Lacoste
Massimo Caccia
Large language model (LLM) agents for web interfaces have advanced rapidly, yet open-source systems still lag behind proprietary agents. Bri… (voir plus)dging this gap is key to enabling customizable, efficient, and privacy-preserving agents. Two challenges hinder progress: the reproducibility issues in RL and LLM agent training, where results often depend on sensitive factors like seeds and decoding parameters, and the focus of prior work on single-step tasks, overlooking the complexities of web-based, multi-step decision-making. We address these gaps by providing a statistically driven study of training LLM agents for web tasks. Our two-stage pipeline combines imitation learning from a Llama 3.3 70B teacher with on-policy fine-tuning via Group Relative Policy Optimization (GRPO) on a Llama 3.1 8B student. Through 240 configuration sweeps and rigorous bootstrapping, we chart the first compute allocation curve for open-source LLM web agents. Our findings show that dedicating one-third of compute to teacher traces and the rest to RL improves MiniWoB++ success by 6 points and closes 60% of the gap to GPT-4o on WorkArena, while cutting GPU costs by 45%. We introduce a principled hyperparameter sensitivity analysis, offering actionable guidelines for robust and cost-effective agent training.
Silent Sabotage: Injecting Backdoors into AI Agents Through Fine-Tuning
Abhay Puri
Chandra Kiran Reddy Evuru
Joshua Kazdan
Avinandan Bose
Maryam Fazel
Sai Rajeswar
Jason Stanley
Krishnamurthy Dj Dvijotham
The rise of AI agents that can use tools, browse the web and interact with computers on behalf of a user, has sparked strong interest in imp… (voir plus)roving these capabilities by explicitly fine-tuning the LLMs/VLMs that power these agents. Several researchers have proposed collecting data by letting the agents interact with their environment (e.g., a computer operating system, the web or a collection of APIs exposed as tools), and improve agent performance by fine tuning on this data. In this work, we show that such data collection can be manipulated by adversaries to insert poisoned traces. By modifying just 5% of collected traces, adversaries can embed stealthy bad behaviors into agents—like leaking confidential user information whenever the tool or webpage exposes a trigger. Our results raise important security concerns in the development of AI agents, and underscore the importance of careful scrutiny of all data collection processes used to improve agentic AI.