Yikang Shen

GraphText: Graph Reasoning in Text Space

Jianan Zhao

Le Zhuo

Meng Qu

Kai Liu

Michael M. Bronstein

Zhaocheng Zhu

Jian Tang

2024-10-09

NeurIPS.cc/2024/Workshop/AFM (poster)

Scattered Mixture-of-Experts Implementation

Shawn Tan

Rameswar Panda

ScatterMoE is an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon techniques in existing implementations, … (see more)and overcoming some of the current limitations to improve batched inference, training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We also fuse expert linear transforms and reordering operations with ParallelLinear, a module that can be used to extend the concept of SMoEs. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture-of-Attention.

2024-07-09

colmweb.org/COLM/2024/Conference (accepted)

Sparse Universal Transformer

Shawn Tan

Zhenfang Chen

Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers and is Turing-complete under certain… (see more) assumptions. Empirical evidence also shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, most state-of-the-art NLP systems use VTs as their backbone model instead of UTs. This is mainly because scaling UT parameters is more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT combines the best of both worlds, achieving strong generalization results on formal language tasks (Logical inference and CFQ) and impressive parameter and computation efficiency on standard natural language benchmarks like WMT'14.

2023-10-06

EMNLP/2023/Conference (accepted)

Unsupervised Dependency Graph Network

Shawn Tan

Alessandro Sordoni

Peng Li

Jie Zhou

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (published)

StructFormer: Joint Unsupervised Induction of Dependency and Constituency Structure from Masked Language Modeling

Yi Tay

Che Zheng

Dara Bahri

Donald Metzler

There are two major classes of natural language grammar -- the dependency grammar that models one-to-one correspondences between words and t… (see more)he constituency grammar that models the assembly of one or several corresponded words. While previous unsupervised parsing methods mostly focus on only inducing one class of grammars, we introduce a novel model, StructFormer, that can simultaneously induce dependency and constituency structure. To achieve this, we propose a new parsing framework that can jointly generate a constituency tree and dependency graph. Then we integrate the induced dependency relations into the transformer, in a differentiable manner, through a novel dependency-constrained self-attention mechanism. Experimental results show that our model can achieve strong results on unsupervised constituency parsing, unsupervised dependency parsing, and masked language modeling at the same time.

2021-07-31

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (published)

Explicitly Modeling Syntax in Language Models with Incremental Parsing and a Dynamic Oracle

Syntax is fundamental to our thinking about language. Failing to capture the structure of input language could lead to generalization proble… (see more)ms and over-parametrization. In the present work, we propose a new syntax-aware language model: Syntactic Ordered Memory (SOM). The model explicitly models the structure with an incremental parser and maintains the conditional probability setting of a standard language model (left-to-right). To train the incremental parser and avoid exposure bias, we also propose a novel dynamic oracle, so that SOM is more robust to wrong parsing decisions. Experiments show that SOM can achieve strong results in language modeling, incremental parsing and syntactic generalization tests, while using fewer parameters than other models.

2021-05-31

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (published)

arxiv.org

Cooperative Semi-Supervised Transfer Learning of Machine Reading Comprehension

Oliver Bender

Franz Josef Och

Yoshua Bengio

R´ejean Ducharme

P Vincent

Kevin Clark

Quoc Minh-Thang Luong

V. Le

Jacob Devlin

Ming-Wei Chang

Kenton Lee

Adam Fisch

Alon Talmor

Robin Jia

Minjoon Seo

Michael R. Glass

A. Gliozzo

Rishav Chakravarti

Ian J Goodfellow

Jean Pouget-Abadie … (see 39 more)

Mehdi Mirza

Serhii Havrylov

Ivan Titov. 2017

Emergence

Jun-Tao He

Jiatao Gu

Jiajun Shen

Marc’Aurelio

Matthew Henderson

I. Casanueva

Nikola Mrkˇsi´c

Pei-hao Su

Tsung-Hsien Wen

Ivan Vuli´c

Yi Tay

Che Zheng

Dara Bahri

Donald

Metzler Aaron

Courville

Structformer

Ashish Vaswani

Noam M. Shazeer

Niki Parmar

Thomas Wolf

Lysandre Debut

Julien Victor Sanh

Clement Chaumond

Anthony Delangue

Pier-339 Moi

Tim ric Cistac

R´emi Rault

Morgan Louf

Qizhe Xie

Eduard H. Hovy

Silei Xu

Sina Jandaghi Semnani

Giovanni Campagna

Pretrained language models have signiﬁcantly 001 improved the performance of down-stream 002 language understanding tasks, including ex-00… (see more)3 tractive question answering, by providing 004 high-quality contextualized word embeddings. 005 However, training question answering models 006 still requires large amounts of annotated data 007 for speciﬁc domains. In this work, we pro-008 pose a cooperative, self-play learning frame-009 work, REGEX, for automatically generating 010 more non-trivial question-answer pairs to im-011 prove model performance. REGEX is built 012 upon a masked answer extraction task with an 013 interactive learning environment containing an 014 answer entity REcognizer, a question Gener-015 ator, and an answer EXtractor. Given a pas-016 sage with a masked entity, the generator gen-017 erates a question around the entity, and the 018 extractor is trained to extract the masked en-019 tity with the generated question and raw texts. 020 The framework allows the training of question 021 generation and answering models on any text 022 corpora without annotation. We further lever-023 age a reinforcement learning technique to re-024 ward generating high-quality questions and to 025 improve the answer extraction model’s perfor-026 mance. Experiment results show that REGEX 027 outperforms the state-of-the-art (SOTA) pre-028 trained language models and transfer learning 029 approaches on standard question-answering 030 benchmarks, and yields the new SOTA per-031 formance under given model size and transfer 032 learning settings. 033

2020-12-31

(published)

www.semanticscholar.org

Learning Task Decomposition with Ordered Memory Policy Network

Yuchen Lu

Siyuan Zhou

Joshua B. Tenenbaum

Chuang Gan

Many complex real-world tasks are composed of several levels of sub-tasks. Humans leverage these hierarchical structures to accelerate the l… (see more)earning process and achieve better generalization. In this work, we study the inductive bias and propose Ordered Memory Policy Network (OMPN) to discover subtask hierarchy by learning from demonstration. The discovered subtask hierarchy could be used to perform task decomposition, recovering the subtask boundaries in an unstruc-tured demonstration. Experiments on Craft and Dial demonstrate that our modelcan achieve higher task decomposition performance under both unsupervised and weakly supervised settings, comparing with strong baselines. OMPN can also bedirectly applied to partially observable environments and still achieve higher task decomposition performance. Our visualization further confirms that the subtask hierarchy can emerge in our model.

2020-12-31

ICLR (published)

Recursive Top-Down Production for Sentence Generation with Latent Trees

Timothy J. O'Donnell

We model the recursive production property of context-free grammars for natural and synthetic languages. To this end, we present a dynamic p… (see more)rogramming algorithm that marginalises over latent binary tree structures with

2020-10-31

Findings of the Association for Computational Linguistics: EMNLP 2020 (published)

arxiv.org

Explicitly Modeling Syntax in Language Model improves Generalization

Syntax is fundamental to our thinking about language. Although neural networks are very successful in many tasks, they do not explicitly mod… (see more)el syntactic structure. Failing to capture the structure of inputs could lead to generalization problems and over-parametrization. In the present work, we propose a new syntax-aware language model: Syntactic Ordered Memory (SOM). The model explicitly models the structure with a one-step look-ahead parser and maintains the conditional probability setting of the standard language model. Experiments show that SOM can achieve strong results in language modeling and syntactic generalization tests, while using fewer parameters then other models.

2020-10-20

arXiv.org (preprint)

dblp.uni-trier.de

Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach

Wenyu Du

Zhouhan Lin

Timothy J. O’Donnell

Yoshua Bengio

Yue Zhang

It is commonly believed that knowledge of syntactic structure should improve language modeling. However, effectively and computationally eff… (see more)iciently incorporating syntactic structure into neural language models has been a challenging topic. In this paper, we make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called "syntactic distances", where information between these two separate objectives shares the same intermediate representation. Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.

2020-06-30

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (published)