A Meta-Learning Approach to Causal Inference
Dragos Cristian Manta
Predicting the effect of unseen interventions is at the heart of many scientific endeavours. While causal discovery is often used to answer … (see more)these causal questions, it involves learning a full causal model, not tailored to the specific goal of predicting unseen interventions, and operates under stringent assumptions. We introduce a novel method based on meta-learning that predicts interventional effects without explicitly assuming a causal model. Our preliminary results on synthetic data show that it can provide good generalization to unseen interventions, and it even compares favorably to a causal discovery method. Our model-agnostic method opens up many avenues for future exploration, particularly for settings where causal discovery cannot be applied.
Meta-World+: An Improved, Standardized, RL Benchmark
Reginald McLean
Evangelos Chatzaroulas
Luc McCutcheon
Frank Röder
Tianhe Yu
Zhanpeng He
K.R. Zentner
Ryan Julian
J K Terry
Isaac Woungang
Nariman Farsad
Multi-task reinforcement learning challenges agents to master diverse skills simultaneously, and Meta-World emerged as the gold standard ben… (see more)chmark for evaluating these algorithms. However, since the introduction of the Meta-World benchmark there have been numerous undocumented changes which inhibit fair comparison of multi-task and meta reinforcement learning algorithms. This work strives to disambiguate these results from the literature, while also producing an open-source version of Meta-World that has full reproducibility of past results.
PyLO: Towards Accessible Learned Optimizers in PyTorch
Learned optimizers have been an active research topic over the past decade, with increasing progress toward practical, general-purpose optim… (see more)izers that can serve as drop-in replacements for widely used methods like Adam. However, recent advances -- such as VeLO, which was meta-trained for 4000 TPU-months -- remain largely inaccessible to the broader community, in part due to their reliance on JAX and the absence of user-friendly packages for applying the optimizers after meta-training. To address this gap, we introduce PyLO, a PyTorch-based library that brings learned optimizers to the broader machine learning community through familiar, widely adopted workflows. Unlike prior work focused on synthetic or convex tasks, our emphasis is on applying learned optimization to real-world large-scale pre-training tasks. Our release includes a CUDA-accelerated version of the small_fc_lopt learned optimizer architecture from (Metz et al., 2022a), delivering substantial speedups -- from 39.36 to 205.59 samples/sec throughput for training ViT B/16 with batch size 32. PyLO also allows us to easily combine learned optimizers with existing optimization tools such as learning rate schedules and weight decay. When doing so, we find that learned optimizers can substantially benefit. Our code is available at https://github.com/Belilovsky-Lab/pylo
Quantized Disentanglement: A Practical Approach
Vitória Barin-Pacela
Kartik Ahuja
Revisiting the Goldilocks Zone in Inhomogeneous Networks
Zacharie Garnier Cuchet
We investigate how architectural inhomogeneities—such as biases, layer normalization, and residual connections—affect the curvature of t… (see more)he loss landscape at initialization and its link to trainability. We focus on the Goldilocks zone, a region in parameter space with excess positive curvature, previously associated with improved optimization in homogeneous networks. To extend this analysis, we compare two scaling strategies: weight scaling and softmax temperature scaling. Our results show that in networks with biases or residual connections, both strategies identify a Goldilocks zone aligned with better training. In contrast, layer normalization leads to lower or negative curvature, yet stable optimization—revealing a disconnect between curvature and trainability. Softmax temperature scaling behaves more consistently across models, making it a more robust probe. Overall, the Goldilocks zone remains relevant in inhomogeneous networks, but its geometry and predictive power depend on architectural choices, particularly normalization.
Spaced Scheduling for Large Language Model Training
Amine El hattami
TGM: A Modular Framework for Machine Learning on Temporal Graphs
While deep learning on static graphs has been revolutionized by standardized libraries like PyTorch Geometric and DGL, machine learning on T… (see more)emporal Graphs (TG), networks that evolve over time, lacks comparable software infrastructure. Existing TG libraries are limited in scope, focusing on a single method category or specific algorithms. We introduce Temporal Graph Modelling (TGM), a comprehensive framework for machine learning on temporal graphs to address this gap. Through a modular architecture, TGM is the first library to support both discrete and continuous-time TG methods and implements a wide range of TG methods. The TGM framework combines an intuitive front-end API with an optimized backend storage, enabling reproducible research and efficient experimentation at scale. Key features include graph-level optimizations for offline training and built-in performance profiling capabilities. Through extensive benchmarking on five real-world networks, TGM is up to 6 times faster than the widely used DyGLib library on TGN and TGAT models and up to 8 times faster than the UTG framework for converting edges into coarse-grained snapshots.
Towards Fair In-Context Learning with Tabular Foundation Models
Patrik Joslin Kenfack
Tabular foundational models have shown promising in-context learning capabilities on structured data by using training examples as context w… (see more)ithout further parameter adjustments. This emerging approach positions itself as a competitive alternative to traditional gradient-boosted tree methods. However, while biases in conventional machine learning models are well documented, it remains unclear how these biases manifest in Tabular ICL. The paper investigates the fairness implications of Tabular ICL and explores three preprocessing strategies—correlation removal, group-balanced demonstration selection, and uncertainty-based demonstration selection—to address bias. Comprehensive experiments indicate that uncertainty-based demonstration selection consistently enhances group fairness in the predictions. The source code for reproducing the results of this work can be found at https://anonymous.4open.science/r/Fair-TabICL-DD84.
Towards Fair In-Context Learning with Tabular Foundation Models
Patrik Joslin Kenfack
Tabular foundational models have shown promising in-context learning capabilities on structured data by using training examples as context w… (see more)ithout further parameter adjustments. This emerging approach positions itself as a competitive alternative to traditional gradient-boosted tree methods. However, while biases in conventional machine learning models are well documented, it remains unclear how these biases manifest in Tabular ICL. The paper investigates the fairness implications of Tabular ICL and explores three preprocessing strategies—correlation removal, group-balanced demonstration selection, and uncertainty-based demonstration selection—to address bias. Comprehensive experiments indicate that uncertainty-based demonstration selection consistently enhances group fairness in the predictions. The source code for reproducing the results of this work can be found at https://anonymous.4open.science/r/Fair-TabICL-DD84.
Tracing the representation geometry of language models from pretraining to post-training
Melody Zixuan Li
Kumar Krishna Agrawal
Komal Kumar Teru
The geometry of representations in a neural network can significantly impact downstream generalization. It is unknown how representation geo… (see more)metry changes in large language models (LLMs) over pretraining and post-training. Here, we characterize the evolving geometry of LLM representations using spectral methods (effective rank and eigenspectrum decay). With the OLMo and Pythia model families we uncover a consistent non-monotonic sequence of three distinct geometric phases in pretraining. An initial \warmup phase sees rapid representational compression. This is followed by an "entropy-seeking" phase, characterized by expansion of the representation manifold's effective dimensionality, which correlates with an increase in memorization. Subsequently, a "compression seeking" phase imposes anisotropic consolidation, selectively preserving variance along dominant eigendirections while contracting others, correlating with improved downstream task performance. We link the emergence of these phases to the fundamental interplay of cross-entropy optimization, information bottleneck, and skewed data distribution. Additionally, we find that in post-training the representation geometry is further transformed: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) correlate with another "entropy-seeking" dynamic to integrate specific instructional or preferential data, reducing out-of-distribution robustness. Conversely, Reinforcement Learning with Verifiable Rewards (RLVR) often exhibits a "compression seeking" dynamic, consolidating reward-aligned behaviors and reducing the entropy in its output distribution. This work establishes the utility of spectral measures of representation geometry for understanding the multiphase learning dynamics within LLMs.
Two-point deterministic equivalence for SGD in random feature models
Alexander Atanasov
Blake Bordelon
Jacob A Zavatone-Veth
Cengiz Pehlevan
Ultrasound and MRI-based evaluation of relationships between morphological and mechanical properties of the lower lumbar multifidus muscle in chronic low back pain.
Neda Naghdi
Sara Masi
Cléo Bertrand
Brent Rosenstein
Hassan Rivaz
Mathieu Roy
Maryse Fortin