Publications

The Impact of Positional Encoding on Length Generalization in Transformers
Inkit Padhi
Karthikeyan Natesan Ramamurthy
Karthikeyan Natesan Ramamurthy
Payel Das
Length generalization, the ability to generalize from small training context sizes to larger ones, is a critical challenge in the developmen… (voir plus)t of Transformer-based language models. Positional encoding (PE) has been identified as a major factor influencing length generalization, but the exact impact of different PE schemes on extrapolation in downstream tasks remains unclear. In this paper, we conduct a systematic empirical study comparing the length generalization performance of decoder-only Transformers with five different position encoding approaches including Absolute Position Embedding (APE), T5's Relative PE, ALiBi, and Rotary, in addition to Transformers without positional encoding (NoPE). Our evaluation encompasses a battery of reasoning and mathematical tasks. Our findings reveal that the most commonly used positional encoding methods, such as ALiBi, Rotary, and APE, are not well suited for length generalization in downstream tasks. More importantly, NoPE outperforms other explicit positional encoding methods while requiring no additional computation. We theoretically demonstrate that NoPE can represent both absolute and relative PEs, but when trained with SGD, it mostly resembles T5's relative PE attention patterns. Finally, we find that scratchpad is not always helpful to solve length generalization and its format highly impacts the model's performance. Overall, our work suggests that explicit position embeddings are not essential for decoder-only Transformers to generalize well to longer sequences.
Thinker: Learning to Plan and Act
Stephen Chung
David Krueger
We propose the Thinker algorithm, a novel approach that enables reinforcement learning agents to autonomously interact with and utilize a le… (voir plus)arned world model. The Thinker algorithm wraps the environment with a world model and introduces new actions designed for interacting with the world model. These model-interaction actions enable agents to perform planning by proposing alternative plans to the world model before selecting a final action to execute in the environment. This approach eliminates the need for handcrafted planning algorithms by enabling the agent to learn how to plan autonomously and allows for easy interpretation of the agent's plan with visualization. We demonstrate the algorithm's effectiveness through experimental results in the game of Sokoban and the Atari 2600 benchmark, where the Thinker algorithm achieves state-of-the-art performance and competitive results, respectively. Visualizations of agents trained with the Thinker algorithm demonstrate that they have learned to plan effectively with the world model to select better actions. Thinker is the first work showing that an RL agent can learn to plan with a learned world model in complex environments.
Towards Hybrid-grained Feature Interaction Selection for Deep Sparse Network
Fuyuan Lyu
Xing Tang
Dugang Liu
Weihong Luo
Liang Chen
xiuqiang He
Xue Liu
A Unified, Scalable Framework for Neural Population Decoding
Mehdi Azabou
Vinam Arora
Venkataramana Ganesh
Santosh Nachimuthu
Michael J. Mendelson
Matthew G. Perich
Eva L. Dyer
Our ability to use deep learning approaches to decipher neural activity would likely benefit from greater scale, in terms of both model size… (voir plus) and datasets. However, the integration of many neural recordings into one unified model is challenging, as each recording contains the activity of different neurons from different individual animals. In this paper, we introduce a training framework and architecture designed to model the population dynamics of neural activity across diverse, large-scale neural recordings. Our method first tokenizes individual spikes within the dataset to build an efficient representation of neural events that captures the fine temporal structure of neural activity. We then employ cross-attention and a PerceiverIO backbone to further construct a latent tokenization of neural population activities. Utilizing this architecture and training framework, we construct a large-scale multi-session model trained on large datasets from seven nonhuman primates, spanning over 158 different sessions of recording from over 27,373 neural units and over 100 hours of recordings. In a number of different tasks, we demonstrate that our pretrained model can be rapidly adapted to new, unseen sessions with unspecified neuron correspondence, enabling few-shot performance with minimal labels. This work presents a powerful new approach for building deep learning tools to analyze neural data and stakes out a clear path to training at scale.
When Do Transformers Shine in RL? Decoupling Memory from Credit Assignment
Reinforcement learning (RL) algorithms face two distinct challenges: learning effective representations of past and present observations, an… (voir plus)d determining how actions influence future returns. Both challenges involve modeling long-term dependencies. The Transformer architecture has been very successful to solve problems that involve long-term dependencies, including in the RL domain. However, the underlying reason for the strong performance of Transformer-based RL methods remains unclear: is it because they learn effective memory, or because they perform effective credit assignment? After introducing formal definitions of memory length and credit assignment length, we design simple configurable tasks to measure these distinct quantities. Our empirical results reveal that Transformers can enhance the memory capability of RL algorithms, scaling up to tasks that require memorizing observations
Conserving avian evolutionary history can effectively safeguard future benefits for people
Rikki Gumbs
Claudia L. Gray
Michael Hoffmann
Rafael Molina-Venegas
Nisha Owen
Phylogenetic diversity (PD)—the evolutionary history of a set of species—is conceptually linked to the maintenance of yet-to-be-discover… (voir plus)ed benefits from biodiversity or “option value.” We used global phylogenetic and utilization data for birds to test the PD option value link, under the assumption that the performance of sets of PD-maximizing species at capturing known benefits is analogous to selecting the same species at a point in human history before these benefits were realized. PD performed better than random at capturing utilized bird species across 60% of tests, with performance linked to the phylogenetic dispersion and prevalence of each utilization category. Prioritizing threatened species for conservation by the PD they encapsulate performs comparably to prioritizing by their functional distinctiveness. However, species selected by each metric show low overlap, indicating that we should conserve both components of biodiversity to effectively conserve a variety of uses. Our findings provide empirical support for the link between evolutionary history and benefits for future generations.
In-Context Learning for Text Classification with Many Labels
M-TAG: A modular teaching-aid for Geant4
Liam Carroll
S. Enger
Estimating the population effectiveness of interventions against COVID-19 in France: a modelling study
Iris Ganser
David L Buckeridge
Jane M Heffernan
M. Prague
Rodolphe Thiébaut
Background Non-pharmaceutical interventions (NPIs) and vaccines have been widely used to manage the COVID-19 pandemic. However, uncertainty … (voir plus)persists regarding the effectiveness of these interventions due to data quality issues, methodological challenges, and differing contextual factors. Accurate estimation of their effects is crucial for future epidemic preparedness. Methods To address this, we developed a population-based mechanistic model that includes the impact of NPIs and vaccines on SARS-CoV-2 transmission and hospitalization rates. Our statistical approach estimated all parameters in one step, accurately propagating uncertainty. We fitted the model to comprehensive epidemiological data in France from March 2020 to October 2021. With the same model, we simulated scenarios of vaccine rollout. Results The first lockdown was the most effective, reducing transmission by 84% (95% confidence interval (CI) 83-85). Subsequent lockdowns had diminished effectiveness (reduction of 74% (69-77) and 11% (9-18), respectively). A 6pm curfew was more effective than one at 8 pm (68% (66-69) vs. 48% (45-49) reduction), while school closures reduced transmission by 15% (12-18). In a scenario without vaccines before November 2021, we predicted 159,000 or 194% (95% prediction interval (PI) 74-424) more deaths and 1,488,000 or 340% (136-689) more hospitalizations. If a vaccine had been available after 100 days, over 71,000 deaths (16,507-204,249) and 384,000 (88,579-1,020,386) hospitalizations could have been averted. Conclusion Our results highlight the substantial impact of NPIs, including lockdowns and curfews, in controlling the COVID-19 pandemic. We also demonstrate the value of the 100 days objective of the CEPI initiative for vaccine availability.
Addressing uncertainty when projecting marine species' distributions under climate change
Sarah C. Davies
Patrick L. Thompson
Catalina Gómez
Jessica Nephin
Anders Knudby
Ashley E. Park
Sarah K. Friesen
Emily M. Rubidge
Sean C. Anderson
Josephine C. Iacarella
Devin A. Lyons
Andrew MacDonald
Andrew McMillan
Eric J. Ward
Amber M. Holdsworth
Neil Swart
Jeff Price
Karen L. Hunter
Artificial Intelligence for Detection of Dementia Using Motion Data: A Scoping Review
Jory Katz
Howard Bergman
Roland Grad
Vladimir Khanassov
Genevieve Gore
Isabelle Vedel
Machelle Wilchesky
Negar Ghourchian
S. A. Rahimi
Background: Dementia is a neurodegenerative disease resulting in the loss of cognitive and psychological functions. Artificial intelligence … (voir plus)(AI) may help in detection and screening of dementia; however, little is known in this area. Objectives: The objective of this study was to identify and evaluate AI interventions for detection of dementia using motion data. Method: The review followed the framework proposed by O’Malley’s and Joanna Briggs Institute methodological guidance for scoping reviews. We adhered to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews (PRISMA-ScR) checklist for reporting the results. An information specialist performed a comprehensive search from the date of inception until November 2020, in five bibliographic databases: MEDLINE, EMBASE, Web of Science Core Collection, CINAHL, and IEEE Xplore. We included studies aimed at the deployment and testing or implementation of AI interventions using motion data for the detection of dementia among a diverse population, encompassing varying age, sex, gender, economic backgrounds, and ethnicity, extending to their health care providers across multiple health care settings. Studies were excluded if they focused on Parkinson’s or Huntington’s disease. Two independent reviewers screened the abstracts, titles, and then read the full-texts. Disagreements were resolved by consensus, and if this was not possible, the opinion of a third reviewer was sought. The reference lists of included studies were also screened. Results: After removing duplicates, 2,632 articles were obtained. After title and abstract screening and full-text screening, 839 articles were considered for categorization. The authors categorized the papers into six categories, and data extraction and synthesis was performed on 20 included papers from the motion tracking data category. The included studies assessed cognitive performance (n = 5, 25%); screened dementia and cognitive decline (n = 8, 40%); investigated visual behaviours (n = 4, 20%); and analyzed motor behaviors (n = 3, 15%). Conclusions: We presented evidence of AI systems being employed in the detection of dementia, showcasing the promising potential of motion tracking within this domain. Although some progress has been made in this field recently, there remain notable research gaps that require further exploration and investigation. Future endeavors need to compare AI interventions using motion data with traditional screening methods or other tech-enabled dementia detection mechanisms. Besides, future works should aim at understanding how gender and sex, and ethnic and cultural sensitivity can contribute to refining AI interventions, ensuring they are accessible, equitable, and beneficial across all society.
A Distributed Data-Parallel PyTorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale
Hao-Jun Michael Shi
Tsung-Hsien Lee
Shintaro Iwasaki
Jose Gallego-Posada
Zhijing Li
Kaushik Rangadurai
Dheevatsa Mudigere
Michael G. Rabbat