Credit-based self organizing maps: training deep topographic networks with minimal performance degradation
Amirozhan Dehghani
Xinyu Qian
Asa Farahani
In the primate neocortex, neurons with similar function are often found to be spatially close. Kohonen's self-organizing map (SOM) has been … (voir plus)one of the most influential approaches for simulating brain-like topographical organization in artificial neural network models. However, integrating these maps into deep neural networks with multitude of layers has been challenging, with self-organized deep neural networks suffering from substantially diminished capacity to perform visual recognition. We identified a key factor leading to the performance degradation in self-organized topographical neural network models: the discord between predominantly bottom-up learning updates in the self-organizing maps, and those derived from top-down, credit-based learning approaches. To address this, we propose an alternative self organization algorithm, tailored to align with the top-down learning processes in deep neural networks. This model not only emulates critical aspects of cortical topography but also significantly narrows the performance gap between non-topographical and topographical models. This advancement underscores the substantial importance of top-down assigned credits in shaping topographical organization. Our findings are a step in reconciling topographical modeling with the functional efficacy of neural network models, paving the way for more brain-like neural architectures.
Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL
Ghada Sokar
Johan Samir Obando Ceron
The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While sof… (voir plus)t mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.
Dualformer: Controllable Fast and Slow Thinking by Learning with Randomized Reasoning Traces
DiJia Su
Sainbayar Sukhbaatar
Yuandong Tian
Qinqing Zheng
Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets
Zhen Liu
Tim Z. Xiao
Weiyang Liu
Dinghuai Zhang
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetun… (voir plus)e pretrained diffusion models on some reward functions that are either designed by experts or learned from small-scale datasets. Existing methods for finetuning diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as
Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference
Matthew D Riemer
Gopeshh Subbaraj
Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectivel… (voir plus)y minimize regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime reinforcement learning (RL) environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment's effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pokémon and Tetris.
Expressivity of Neural Networks with Random Weights and Learned Biases
Ezekiel Williams
Avery Hee-Woon Ryoo
Thomas Jiralerspong
Alexandre Payeur
Luca Mazzucato
Fair Resource Allocation in Weakly Coupled Markov Decision Processes
Xiaohui Tu
Yossiri Adulyasak
Nima Akbarzadeh
We consider fair resource allocation in sequential decision-making environments modeled as weakly coupled Markov decision processes, where r… (voir plus)esource constraints couple the action spaces of
Fast Convergence of Softmax Policy Mirror Ascent
Reza Asad
Reza Babanezhad Harikandeh
Issam Hadj Laradji
Sharan Vaswani
Natural policy gradient (NPG) is a common policy optimization algorithm and can be viewed as mirror ascent in the space of probabilities. Re… (voir plus)cently,~\citet{vaswani2021general} introduced a policy gradient method that corresponds to mirror ascent in the dual space of logits. We refine this algorithm, removing its need for a normalization across actions and analyze the resulting method (referred to as SPMA). For tabular MDPs, we prove that SPMA with a constant step-size matches the linear convergence of NPG and achieves a faster convergence than constant step-size (accelerated) softmax policy gradient. To handle large state-action spaces, we extend SPMA to use a log-linear policy parameterization. Unlike that for NPG, generalizing SPMA to the linear function approximation (FA) setting does not require compatible function approximation. Unlike MDPO, a practical generalization of NPG, SPMA with linear FA only requires solving convex softmax classification problems. We prove that SPMA achieves linear convergence to the neighbourhood of the optimal value function. We extend SPMA to handle non-linear FA and evaluate its empirical performance on the MuJoCo and Atari benchmarks. Our results demonstrate that SPMA consistently achieves similar or better performance compared to MDPO, PPO and TRPO.
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch
Shengyi Huang
Sophie Xhonneux
Arian Hosseini
To achieve state-of-the-art chatbots, large language models are finetuned with reinforcement learning (RL), frequently to optimize human fee… (voir plus)dback (RLHF). This process is computationally expensive and can take weeks. Offline approaches, like DPO, learn on a static dataset and are efficient but not performant. The dominant paradigm, online and on-policy---synchronously generating from the model, labelling with a reward model, and learning on feedback from the model's own outputs---is performant but not efficient. Following prior work in the generall deep RL setting, we propose separating the actor and learner in RLHF. This enables the asynchronously generation of new samples while learning on prior samples, thus leading to overall faster training and better scaling. But this requires a novel regime for RLHF, online but off-policy: learning on samples from a previous version of our model. We ask a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? We find that a contrastive loss, Online DPO, is most robust to off-policy data and that robustness increases with the scale of the policy model. We show even further compute optimizations but demonstrate that they come at a performance cost, giving rise to a trade-off. Finally, we verify our design choices by training LLaMA 3.1 8B with RLHF as a helpful chatbot in half the time of a synchronous run while matching final performance.
Feasible Learning
Juan Ramirez
Ignacio Hounie
Juan Elenter
Jose Gallego-Posada
Meraj Hashemizadeh
Alejandro Ribeiro
We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bound… (voir plus)s the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance *on every individual data point*. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.
Forgetting Transformer: Softmax Attention with a Forget Gate
Zhixuan Lin
Evgenii Nikishin
Xu He
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (voir plus) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a "Pro" block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).
Forgetting Transformer: Softmax Attention with a Forget Gate
Zhixuan Lin
Evgenii Nikishin
Xu He
An essential component of modern recurrent sequence models is the forget gate. While Transformers do not have an explicit recurrent form, we… (voir plus) show that a forget gate can be naturally incorporated into Transformers by down-weighting the unnormalized attention scores in a data-dependent way. We name this attention mechanism the Forgetting Attention and the resulting model the Forgetting Transformer (FoX). We show that FoX outperforms the Transformer on long-context language modeling, length extrapolation, and short-context downstream tasks, while performing on par with the Transformer on long-context downstream tasks. Moreover, it is compatible with the FlashAttention algorithm and does not require any positional embeddings. Several analyses, including the needle-in-the-haystack test, show that FoX also retains the Transformer's superior long-context capabilities over recurrent sequence models such as Mamba-2, HGRN2, and DeltaNet. We also introduce a ``Pro'' block design that incorporates some common architectural components in recurrent sequence models and find it significantly improves the performance of both FoX and the Transformer. Our code is available at [`https://github.com/zhixuan-lin/forgetting-transformer`](https://github.com/zhixuan-lin/forgetting-transformer).