Jonathan Pilault

JaxPruner: A concise library for sparsity research

Joo Hyung Lee

Wonpyo Park

Nicole Elyse Mitchell

Gintare Karolina Dziugaite

Johan Samir Obando Ceron

Han-Byul Kim

Namhoon Lee

Elias Frantar

Yun Long

Amir Yazdanbakhsh

Shivani Agrawal

Suvinay Subramanian

Xin Wang

Sheng-Chun Kao

Xingyao Zhang

Trevor Gale

Aart J.C. Bik

Woohyun Han

Milen Ferev

Zhonglin Han … (see 5 more)

Hong-Seok Kim

Yann Dauphin

Pablo Samuel Castro

Utku Evci

This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims … (see more)to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks.

2024-01-08

Conference on Parsimony and Learning (published)

doi.org

Block-State Transformers

Mahan Fathi

Orhan Firat

Pierre-Luc Bacon

Ross Goroshin

2023-09-21

NeurIPS.cc/2023/Conference (poster)

Block-State Transformers

Mahan Fathi

Orhan Firat

2023-06-15

ArXiv (preprint)

Block-State Transformers

Mahan Fathi

Orhan Firat

State space models (SSMs) have shown impressive results on tasks that require modeling long-range dependencies and efficiently scale to long… (see more) sequences owing to their subquadratic runtime complexity. Originally designed for continuous signals, SSMs have shown superior performance on a plethora of tasks, in vision and audio; however, SSMs still lag Transformer performance in Language Modeling tasks. In this work, we propose a hybrid layer named Block-State Transformer (BST), that internally combines an SSM sublayer for long-range contextualization, and a Block Transformer sublayer for short-term representation of sequences. We study three different, and completely parallelizable, variants that integrate SSMs and block-wise attention. We show that our model outperforms similar Transformer-based architectures on language modeling perplexity and generalizes to longer sequences. In addition, the Block-State Transformer demonstrates more than tenfold increase in speed at the layer level compared to the Block-Recurrent Transformer when model parallelization is employed.

2023-06-15

ArXiv (preprint)

Block-State Transformers

Mahan Fathi

Orhan Firat

2023-06-15

ArXiv (preprint)

Block-State Transformers

Mahan Fathi

Orhan Firat

2023-06-15

ArXiv (preprint)

Block-State Transformers

Mahan Fathi

Orhan Firat

2023-06-15

ArXiv (preprint)

Block-State Transformers

Mahan Fathi

Orhan Firat

2023-01-01

NeurIPS (published)

Block-State Transformers

Mahan Fathi

Orhan Firat

Pierre-Luc Bacon

Ross Goroshin

Using Graph Algorithms to Pretrain Graph Completion Transformers

Mikhail Galkin

Bahare Fatemi

Perouz Taslakian

David Vasquez

Recent work on Graph Neural Networks has demonstrated that self-supervised pretraining can further enhance performance on downstream graph, … (see more)link, and node classification tasks. However, the efficacy of pretraining tasks has not been fully investigated for downstream large knowledge graph completion tasks. Using a contextualized knowledge graph embedding approach, we investigate five different pretraining signals, constructed using several graph algorithms and no external data, as well as their combination. We leverage the versatility of our Transformer-based model to explore graph structure generation pretraining tasks (i.e. path and k-hop neighborhood generation), typically inapplicable to most graph embedding methods. We further propose a new path-finding algorithm guided by information gain and find that it is the best-performing pretraining task across three downstream knowledge graph completion datasets. While using our new path-finding algorithm as a pretraining signal provides 2-3% MRR improvements, we show that pretraining on all signals together gives the best knowledge graph completion results. In a multitask setting that combines all pretraining tasks, our method surpasses the latest and strong performing knowledge graph embedding methods on all metrics for FB15K-237, on MRR and Hit@1 for WN18RRand on MRR and hit@10 for JF17K (a knowledge hypergraph dataset).

2022-10-14

ArXiv (preprint)

doi.org

Conditionally Adaptive Multi-Task Learning: Improving Transfer Learning in NLP Using Fewer Parameters & Less Data

Amine El hattami

Multi-Task Learning (MTL) networks have emerged as a promising method for transferring learned knowledge across different tasks. However, MT… (see more)L must deal with challenges such as: overfitting to low resource tasks, catastrophic forgetting, and negative task transfer, or learning interference. Often, in Natural Language Processing (NLP), a separate model per task is needed to obtain the best performance. However, many fine-tuning approaches are both parameter inefficient, i.e., potentially involving one new model per task, and highly susceptible to losing knowledge acquired during pretraining. We propose a novel Transformer based Hypernetwork Adapter consisting of a new conditional attention mechanism as well as a set of task-conditioned modules that facilitate weight sharing. Through this construction, we achieve more efficient parameter sharing and mitigate forgetting by keeping half of the weights of a pretrained model fixed. We also use a new multi-task data sampling strategy to mitigate the negative effects of data imbalance across tasks. Using this approach, we are able to surpass single task fine-tuning methods while being parameter and data efficient (using around 66% of the data). Compared to other BERT Large methods on GLUE, our 8-task model surpasses other Adapter methods by 2.8% and our 24-task model outperforms by 0.7-1.0% models that use MTL and single task fine-tuning. We show that a larger variant of our single multi-task model approach performs competitively across 26 NLP tasks and yields state-of-the-art results on a number of test and development sets.

2021-01-01

ICLR (published)