Model Parallelism With Subnetwork Data Parallelism
Vaibhav Singh
Zafir Khalid
Edouard Oyallon
Distributed pre-training of large models at scale often imposes heavy memory demands on individual nodes and incurs significant intra-node c… (see more)ommunication costs. We propose a novel alternative approach that reduces the memory requirements by training small, structured subnetworks of the model on separate workers. Unlike pipelining, our method avoids inter-node activation communication and maintains bandwidth requirements that are comparable to or lower than standard data parallel communication schemes based on all-reduce. We evaluate two subnetwork construction strategies guided by the principle of ensuring uniform representation of each parameter across the distributed training setup. Our results show that the stochastic block dropping technique consistently outperforms the width-wise subnetwork construction previously explored in federated learning. We empirically attribute this superior performance to stronger gradient alignment in subnetworks that retain blocks having skip connections. Preliminary experiments highlight the promise of our approach, achieving a
MuLoCo: Muon is a practical inner optimizer for DiLoCo
Benjamin Thérien
Xiaolong Huang
Next-Token Prediction Should be Ambiguity-Sensitive : A Meta-Learing Perspective
Leo Gagnon
Eric Elmoznino
Sarthak Mittal
Tom Marty
Tejas Kasetty
Overcoming Long-Context Limitations of State-Space Models via Context-Dependent Sparse Attention
Zhihao Zhan
Jianan Zhao
Zhaocheng Zhu
Efficient long-context modeling remains a critical challenge for natural language processing (NLP), as the time complexity of the predominan… (see more)t Transformer architecture scales quadratically with the sequence length. While state-space models (SSMs) offer alternative sub-quadratic solutions, they struggle to capture long-range dependencies effectively. In this work, we focus on analyzing and improving the long-context modeling capabilities of SSMs. We show that the widely used synthetic task, associative recall, which requires a model to recall a value associated with a single key without context, insufficiently represents the complexities of real-world long-context modeling. To address this limitation, we extend the associative recall to a novel synthetic task, \emph{joint recall}, which requires a model to recall the value associated with a key given in a specified context. Theoretically, we prove that SSMs do not have the expressiveness to solve multi-query joint recall in sub-quadratic time complexity. To resolve this issue, we propose a solution based on integrating SSMs with Context-Dependent Sparse Attention (CDSA), which has the expressiveness to solve multi-query joint recall with sub-quadratic computation. To bridge the gap between theoretical analysis and real-world applications, we propose locality-sensitive Hashing Attention with sparse Key Selection (HAX), which instantiates the theoretical solution and is further tailored to natural language domains. Extensive experiments on both synthetic and real-world long-context benchmarks show that HAX consistently outperforms SSM baselines and SSMs integrated with context-independent sparse attention (CISA).
Self-Predictive Representations for Combinatorial Generalization in Behavioral Cloning
Daniel Lawson
Adriana Hugessen
Charlotte Cloutier
Behavioral cloning (BC) methods trained with supervised learning (SL) are an effective way to learn policies from human demonstrations in do… (see more)mains like robotics. Goal-conditioning these policies enables a single generalist policy to capture diverse behaviors contained within an offline dataset. While goal-conditioned behavior cloning (GCBC) methods can perform well on in-distribution training tasks, they do not necessarily generalize zero-shot to tasks that require conditioning on novel state-goal pairs, i.e. combinatorial generalization. In part, this limitation can be attributed to a lack of temporal consistency in the state representation learned by BC; if temporally related states are encoded to similar latent representations, then the out-of-distribution gap for novel state-goal pairs would be reduced. Hence, encouraging this temporal consistency in the representation space should facilitate combinatorial generalization. Successor representations, which encode the distribution of future states visited from the current state, nicely encapsulate this property. However, previous methods for learning successor representations have relied on contrastive samples, temporal-difference (TD) learning, or both. In this work, we propose a simple yet effective representation learning objective,
The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset
Gilad Landau
Miran Ozdogan
Gereon Elvers
Francesco Mantegna
Pratik Somaiya
Dulhan Hansaja Jayalath
Luisa Kurth
Teyun Kwon
Brendan Shillingford
Greg Farquhar
Minqi Jiang
Hamza Abdelhedi
Yorguin Mantilla Ramos
Caglar Gulcehre
M. Woolrich
Natalie Voets
Oiwi Parker Jones
The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising appli… (see more)cations is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an"ImageNet moment"or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.
The 2025 PNPL Competition: Speech Detection and Phoneme Classification in the LibriBrain Dataset
Gilad Landau
Miran Ozdogan
Gereon Elvers
Francesco Mantegna
Pratik Somaiya
Dulhan Hansaja Jayalath
Luisa Kurth
Teyun Kwon
Brendan Shillingford
Greg Farquhar
Minqi Jiang
Hamza Abdelhedi
Yorguin Mantilla Ramos
Caglar Gulcehre
M. Woolrich
Natalie Voets
Oiwi Parker Jones
The advance of speech decoding from non-invasive brain data holds the potential for profound societal impact. Among its most promising appli… (see more)cations is the restoration of communication to paralysed individuals affected by speech deficits such as dysarthria, without the need for high-risk surgical interventions. The ultimate aim of the 2025 PNPL competition is to produce the conditions for an"ImageNet moment"or breakthrough in non-invasive neural decoding, by harnessing the collective power of the machine learning community. To facilitate this vision we present the largest within-subject MEG dataset recorded to date (LibriBrain) together with a user-friendly Python library (pnpl) for easy data access and integration with deep learning frameworks. For the competition we define two foundational tasks (i.e. Speech Detection and Phoneme Classification from brain data), complete with standardised data splits and evaluation metrics, illustrative benchmark models, online tutorial code, a community discussion board, and public leaderboard for submissions. To promote accessibility and participation the competition features a Standard track that emphasises algorithmic innovation, as well as an Extended track that is expected to reward larger-scale computing, accelerating progress toward a non-invasive brain-computer interface for speech.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Koustuv Sinha
Artem Zholus
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Yong Li
Xiaodong Ma
Franziska Meier
Yann LeCun
Nicolas Ballas
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mahmoud Assran
Adrien Bardes
David Fan
Quentin Garrido
Russell Howes
Mojtaba Komeili
Matthew J. Muckley
Ammar Rizvi
Claire Roberts
Koustuv Sinha
Artem Zholus
Sergio Arnaud
Abha Gejji
Ada Martin
Francois Robert Hogan
Daniel Dugas
Piotr Bojanowski
Vasil Khalidov
Patrick Labatut
Francisco Massa … (see 13 more)
Marc Szafraniec
K. Krishnakumar
Yong Li
Xiaodong Ma
Franziska Meier
Yann LeCun
Nicolas Ballas
Fair at Meta
Mila - Québec
AI Institute
Polytechnique Montréal
A major challenge for modern AI is to learn to understand the world and learn to act largely by observation. This paper explores a self-supe… (see more)rvised approach that combines internet-scale video data with a small amount of interaction data (robot trajectories), to develop models capable of understanding, predicting, and planning in the physical world. We first pre-train an action-free joint-embedding-predictive architecture, V-JEPA 2, on a video and image dataset comprising over 1 million hours of internet video. V-JEPA 2 achieves strong performance on motion understanding (77.3 top-1 accuracy on Something-Something v2) and state-of-the-art performance on human action anticipation (39.7 recall-at-5 on Epic-Kitchens-100) surpassing previous task-specific models. Additionally, after aligning V-JEPA 2 with a large language model, we demonstrate state-of-the-art performance on multiple video question-answering tasks at the 8 billion parameter scale (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Finally, we show how self-supervised learning can be applied to robotic planning tasks by post-training a latent action-conditioned world model, V-JEPA 2-AC, using less than 62 hours of unlabeled robot videos from the Droid dataset. We deploy V-JEPA 2-AC zero-shot on Franka arms in two different labs and enable picking and placing of objects using planning with image goals. Notably, this is achieved without collecting any data from the robots in these environments, and without any task-specific training or reward. This work demonstrates how self-supervised learning from web-scale data and a small amount of robot interaction data can yield a world model capable of planning in the physical world.
CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Shravan Nayak
Mehar Bhatia
Xiaofeng Zhang
Verena Rieser
Lisa Anne Hendricks
Sjoerd van Steenkiste
Yash Goyal
Karolina Stanczak
The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (see more)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.
CulturalFrames: Assessing Cultural Expectation Alignment in Text-to-Image Models and Evaluation Metrics
Shravan Nayak
Mehar Bhatia
Xiaofeng Zhang
Verena Rieser
Lisa Anne Hendricks
Sjoerd van Steenkiste
Yash Goyal
Karolina Stanczak
The increasing ubiquity of text-to-image (T2I) models as tools for visual content generation raises concerns about their ability to accurate… (see more)ly represent diverse cultural contexts. In this work, we present the first study to systematically quantify the alignment of T2I models and evaluation metrics with respect to both explicit as well as implicit cultural expectations. To this end, we introduce CulturalFrames, a novel benchmark designed for rigorous human evaluation of cultural representation in visual generations. Spanning 10 countries and 5 socio-cultural domains, CulturalFrames comprises 983 prompts, 3637 corresponding images generated by 4 state-of-the-art T2I models, and over 10k detailed human annotations. We find that T2I models not only fail to meet the more challenging implicit expectations but also the less challenging explicit expectations. Across models and countries, cultural expectations are missed an average of 44% of the time. Among these failures, explicit expectations are missed at a surprisingly high average rate of 68%, while implicit expectation failures are also significant, averaging 49%. Furthermore, we demonstrate that existing T2I evaluation metrics correlate poorly with human judgments of cultural alignment, irrespective of their internal reasoning. Collectively, our findings expose critical gaps, providing actionable directions for developing more culturally informed T2I models and evaluation methodologies.
Did I Faithfully Say What I Thought? Bridging the Gap Between Neural Activity and Self-Explanations in Large Language Models
Milan Bhan
Jean-Noël Vittaut
Nicolas Chesneau
Marie-Jeanne Lesot
Large Language Models (LLM) have demonstrated the capability of generating free text self Natural Language Explanation (self-NLE) to justify… (see more) their answers. Despite their logical appearance, self-NLE do not necessarily reflect the LLM actual decision-making process, making such explanations unfaithful. While existing methods for measuring self-NLE faithfulness mostly rely on behavioral tests or computational block identification, none of them examines the neural activity underlying the model's reasoning. This work introduces a novel flexible framework for quantitatively measuring the faithfulness of LLM-generated self-NLE by directly comparing the latter with interpretations of the model's internal hidden states. The proposed framework is versatile and provides deep insights into self-NLE faithfulness by establishing a direct connection between self-NLE and model reasoning. This approach advances the understanding of self-NLE faithfulness and provides building blocks for generating more faithful self-NLE.