Jackie Cheung

Ines Arous

Visiteur de recherche indépendant - McGill

Doctorat - McGill

Collaborateur·rice alumni - McGill

Doctorat - McGill

Doctorat - McGill

Superviseur⋅e principal⋅e :

Maîtrise recherche - McGill

Maxime Darrin

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Aylin Erman

Doctorat - McGill

Co-superviseur⋅e :

Dan Poenaru

Ori Ernst

Collaborateur·rice alumni - McGill

Maîtrise recherche - McGill

Jie Gao

Collaborateur·rice de recherche - McGill University

Co-superviseur⋅e :

Nikki Lobczowski

Langlois Henri

Maîtrise recherche - Paris-Saclay University

Superviseur⋅e principal⋅e :

Pablo Piantanida

Fanny JOURDAN

Postdoctorat - École de technologie suprérieure

Superviseur⋅e principal⋅e :

Pablo Piantanida

Jin Won Lee

Collaborateur·rice de recherche - McGill

Zichao Li

Doctorat - McGill

Superviseur⋅e principal⋅e :

Siva Reddy

Caleb Moses

Doctorat - McGill

Sihan Qin

Baccalauréat - McGill

Shalaleh Rismani

Postdoctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Baccalauréat - McGill

Cesare Spinoso-Di Piano

Doctorat - McGill

Michael Yu

Collaborateur·rice de recherche - McGill University

Co-superviseur⋅e :

Nikki Lobczowski

Xiyuan Zou

Maîtrise recherche - McGill

Publications

The Topic Confusion Task: A Novel Scenario for Authorship Attribution

Malik H. Altakrori

Benjamin Fung

Authorship attribution is the problem of identifying the most plausible author of an anonymous text from a set of candidate authors. Researc… (voir plus)hers have investigated same-topic and cross-topic scenarios of authorship attribution, which differ according to whether unseen topics are used in the testing phase. However, neither scenario allows us to explain whether errors are caused by failure to capture authorship style, by the topic shift or by other factors. Motivated by this, we propose the topic confusion task, where we switch the author-topic conﬁg-uration between training and testing set. This setup allows us to probe errors in the attribution process. We investigate the accuracy and two error measures: one caused by the models’ confusion by the switch because the features capture the topics, and one caused by the features’ inability to capture the writing styles, leading to weaker models. By evaluating different features, we show that stylometric features with part-of-speech tags are less susceptible to topic variations and can increase the accuracy of the attribution process. We further show that combining them with word-level n - grams can outperform the state-of-the-art technique in the cross-topic scenario. Finally, we show that pretrained language models such as BERT and RoBERTa perform poorly on this task, and are outperformed by simple n -gram features.

2021-01-01

arXiv.org (prépublication)

dblp.uni-trier.de

An Analysis of Dataset Overlap on Winograd-Style Tasks

Ali Emami

Adam Trischler

Kaheer Suleman

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model per… (voir plus)formance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.

2020-12-01

Proceedings of the 28th International Conference on Computational Linguistics (publié)

Learning Efficient Task-Specific Meta-Embeddings with Word Prisms

Jingyi He

Kc Tsiolis

Kian Kenyon-Dean

Word embeddings are trained to predict word cooccurrence statistics, which leads them to possess different lexical properties (syntactic, se… (voir plus)mantic, etc.) depending on the notion of context defined at training time. These properties manifest when querying the embedding space for the most similar vectors, and when used at the input layer of deep neural networks trained to solve downstream NLP problems. Meta-embeddings combine multiple sets of differently trained word embeddings, and have been shown to successfully improve intrinsic and extrinsic performance over equivalent models which use just one set of source embeddings. We introduce word prisms: a simple and efficient meta-embedding method that learns to combine source embeddings according to the task at hand. Word prisms learn orthogonal transformations to linearly combine the input source embeddings, which allows them to be very efficient at inference time. We evaluate word prisms in comparison to other meta-embedding methods on six extrinsic evaluations and observe that word prisms offer improvements in performance on all tasks.

2020-12-01

Proceedings of the 28th International Conference on Computational Linguistics (publié)

Learning Lexical Subspaces in a Distributional Vector Space

Kushal Arora

Aishik Chakraborty

Abstract In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about… (voir plus) lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model.1 Code available at https://github.com/aishikchakraborty/LexSub.

2020-12-01

Transactions of the Association for Computational Linguistics (publié)

On Posterior Collapse and Encoder Feature Dispersion in Sequence VAEs.

Teng Long

Yanshuai Cao

Variational autoencoders (VAEs) hold great potential for modelling text, as they could in theory separate high-level semantic and syntactic … (voir plus)properties from local regularities of natural language. Practically, however, VAEs with autoregressive decoders often suffer from posterior collapse, a phenomenon where the model learns to ignore the latent variables, causing the sequence VAE to degenerate into a language model. In this paper, we argue that posterior collapse is in part caused by the lack of dispersion in encoder features. We provide empirical evidence to verify this hypothesis, and propose a straightforward fix using pooling. This simple technique effectively prevents posterior collapse, allowing model to achieve significantly better data log-likelihood than standard sequence VAEs. Comparing to existing work, our proposed method is able to achieve comparable or superior performances while being more computationally efficient.

Deconstructing Word Embedding Algorithms

Kian Kenyon-Dean

Edward Daniel Newell

2020-11-01

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (publié)

Factual Error Correction for Abstractive Summarization Models

Meng Cao

Yue Dong

Jiapeng Wu

Neural abstractive summarization systems have achieved promising progress, thanks to the availability of large-scale datasets and models pre… (voir plus)-trained with self-supervised methods. However, ensuring the factual consistency of the generated summaries for abstractive summarization systems is a challenge. We propose a post-editing corrector module to address this issue by identifying and correcting factual errors in generated summaries. The neural corrector model is pre-trained on artificial examples that are created by applying a series of heuristic transformations on reference summaries. These transformations are inspired by an error analysis of state-of-the-art summarization model outputs. Experimental results show that our model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. We also find that transferring from artificial error correction to downstream settings is still very challenging.

2020-11-01

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (publié)

Multi-Fact Correction in Abstractive Text Summarization

Yue Dong

Shuohang Wang

Zhe Gan

Yu Cheng

Jingjing Liu

2020-11-01

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (publié)

TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion

Jiapeng Wu

Meng Cao

William L. Hamilton

Inferring missing facts in temporal knowledge graphs (TKGs) is a fundamental and challenging task. Previous works have approached this probl… (voir plus)em by augmenting methods for static knowledge graphs to leverage time-dependent representations. However, these methods do not explicitly leverage multi-hop structural information and temporal facts from recent time steps to enhance their predictions. Additionally, prior work does not explicitly address the temporal sparsity and variability of entity distributions in TKGs. We propose the Temporal Message Passing (TeMP) framework to address these challenges by combining graph neural networks, temporal dynamics models, data imputation and frequency-based gating techniques. Experiments on standard TKG tasks show that our approach provides substantial gains compared to the previous state of the art, achieving a 10.7% average relative improvement in Hits@10 across three standard benchmarks. Our analysis also reveals important sources of variability both within and across TKG datasets, and we introduce several simple but strong baselines that outperform the prior state of the art in certain settings.

2020-11-01

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (publié)

TESA: A Task in Entity Semantic Aggregation for Abstractive Summarization

Clement Jumel

Annie Priyadarshini Louis

Human-written texts contain frequent generalizations and semantic aggregation of content. In a document, they may refer to a pair of named e… (voir plus)ntities such as ‘London’ and ‘Paris’ with different expressions: “the major cities”, “the capital cities” and “two European cities”. Yet generation, especially, abstractive summarization systems have so far focused heavily on paraphrasing and simplifying the source content, to the exclusion of such semantic abstraction capabilities. In this paper, we present a new dataset and task aimed at the semantic aggregation of entities. TESA contains a dataset of 5.3K crowd-sourced entity aggregations of Person, Organization, and Location named entities. The aggregations are document-appropriate, meaning that they are produced by annotators to match the situational context of a given news article from the New York Times. We then build baseline models for generating aggregations given a tuple of entities and document context. We finetune on TESA an encoder-decoder language model and compare it with simpler classification methods based on linguistically informed features. Our quantitative and qualitative evaluations show reasonable performance in making a choice from a given list of expressions, but free-form expressions are understandably harder to generate and evaluate.

2020-11-01

Conference on Empirical Methods in Natural Language Processing (publié)

HipoRank: Incorporating Hierarchical and Positional Information into Graph-based Unsupervised Long Document Extractive Summarization

Yue Dong

Andrei Mircea

We propose a novel graph-based ranking model for unsupervised extractive summarization of long documents. Graph-based ranking models typical… (voir plus)ly represent documents as undirected fully-connected graphs, where a node is a sentence, an edge is weighted based on sentence-pair similarity, and sentence importance is measured via node centrality. Our method leverages positional and hierarchical information grounded in discourse structure to augment a document's graph representation with hierarchy and directionality. Experimental results on PubMed and arXiv datasets show that our approach outperforms strong unsupervised baselines by wide margins and performs comparably to some of the state-of-the-art supervised models that are trained on hundreds of thousands of examples. In addition, we find that our method provides comparable improvements with various distributional sentence representations; including BERT and RoBERTa models fine-tuned on sentence similarity.

2020-05-01

ArXiv (prépublication)

Investigating the Influence of Selected Linguistic Features on Authorship Attribution using German News Articles

Manuel Sage

Pietro Cruciata

Raed Abdo