Publications

ComplexDataLab at W-NUT 2020 Task 2: Detecting Informative COVID-19 Tweets by Attending over Linked Documents
Kellin Pelrine
Jacob Danovitch
Albert Orozco Camacho
Given the global scale of COVID-19 and the flood of social media content related to it, how can we find informative discussions? We present … (see more)Gapformer, which effectively classifies content as informative or not. It reformulates the problem as graph classification, drawing on not only the tweet but connected webpages and entities. We leverage a pre-trained language model as well as the connections between nodes to learn a pooled representation for each document network. We show it outperforms several competitive baselines and present ablation studies supporting the benefit of the linked information. Code is available on Github.
Deconstructing Word Embedding Algorithms
Kian Kenyon-Dean
Edward Daniel Newell
Factual Error Correction for Abstractive Summarization Models
Meng Cao
Yue Dong
Jiapeng Wu
Neural abstractive summarization systems have achieved promising progress, thanks to the availability of large-scale datasets and models pre… (see more)-trained with self-supervised methods. However, ensuring the factual consistency of the generated summaries for abstractive summarization systems is a challenge. We propose a post-editing corrector module to address this issue by identifying and correcting factual errors in generated summaries. The neural corrector model is pre-trained on artificial examples that are created by applying a series of heuristic transformations on reference summaries. These transformations are inspired by an error analysis of state-of-the-art summarization model outputs. Experimental results show that our model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. We also find that transferring from artificial error correction to downstream settings is still very challenging.
MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining
Zhi Wen
Xing Han Lu
Multi-Fact Correction in Abstractive Text Summarization
Yue Dong
Shuohang Wang
Zhe Gan
Yu Cheng
Jingjing Liu
Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
Yao Lu
Yue Dong
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-sc… (see more)ale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results—using several state-of-the-art models trained on the Multi-XScience dataset—reveal that Multi-XScience is well suited for abstractive models.
TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion
Jiapeng Wu
Meng Cao
William Hamilton
Inferring missing facts in temporal knowledge graphs (TKGs) is a fundamental and challenging task. Previous works have approached this probl… (see more)em by augmenting methods for static knowledge graphs to leverage time-dependent representations. However, these methods do not explicitly leverage multi-hop structural information and temporal facts from recent time steps to enhance their predictions. Additionally, prior work does not explicitly address the temporal sparsity and variability of entity distributions in TKGs. We propose the Temporal Message Passing (TeMP) framework to address these challenges by combining graph neural networks, temporal dynamics models, data imputation and frequency-based gating techniques. Experiments on standard TKG tasks show that our approach provides substantial gains compared to the previous state of the art, achieving a 10.7% average relative improvement in Hits@10 across three standard benchmarks. Our analysis also reveals important sources of variability both within and across TKG datasets, and we introduce several simple but strong baselines that outperform the prior state of the art in certain settings.
TESA: A Task in Entity Semantic Aggregation for Abstractive Summarization
Clément Jumel
Annie Priyadarshini Louis
Human-written texts contain frequent generalizations and semantic aggregation of content. In a document, they may refer to a pair of named e… (see more)ntities such as ‘London’ and ‘Paris’ with different expressions: “the major cities”, “the capital cities” and “two European cities”. Yet generation, especially, abstractive summarization systems have so far focused heavily on paraphrasing and simplifying the source content, to the exclusion of such semantic abstraction capabilities. In this paper, we present a new dataset and task aimed at the semantic aggregation of entities. TESA contains a dataset of 5.3K crowd-sourced entity aggregations of Person, Organization, and Location named entities. The aggregations are document-appropriate, meaning that they are produced by annotators to match the situational context of a given news article from the New York Times. We then build baseline models for generating aggregations given a tuple of entities and document context. We finetune on TESA an encoder-decoder language model and compare it with simpler classification methods based on linguistically informed features. Our quantitative and qualitative evaluations show reasonable performance in making a choice from a given list of expressions, but free-form expressions are understandably harder to generate and evaluate.
Association between extreme precipitation, drinking water and acute gastrointestinal illness in the Great Lakes
R. Graydon
M. Mezzacapo
J. Boehme
S. Foldy
T. Edge
J. Brubacher
L. Chan
M. Dellinger
E. Faustman
J. Rose
T. Takaro
DoMoBOT: a bot for automated and interactive domain modelling
Rijul Saini
Gunter Mussbacher
Jörg Kienzle
Domain modelling transforms domain problem descriptions written in natural language (NL) into analyzable and concise domain models (class di… (see more)agrams) during requirements analysis or the early stages of design in software development. Since the practice of domain modelling requires time in addition to modelling skills and experience, several approaches have been proposed to automate or semi-automate the construction of domain models from problem descriptions expressed in NL. Despite the existing work on domain model extraction, some significant challenges remain unaddressed: (i) the extracted domain models are not accurate enough to be used directly or with minor modifications in software development, (ii) existing approaches do not facilitate the tracing of the rationale behind the modelling decisions taken by the model extractor, and (iii) existing approaches do not provide interactive interfaces to update the extracted domain models. Therefore, in this paper, we introduce a domain modelling bot called DoMoBOT, explain its architecture, and implement it in the form of a web-based prototype tool. The bot automatically extracts a domain model from a problem description written in NL with an accuracy higher than existing approaches. Furthermore, the bot enables modellers to update a part of the extracted domain model and in response the bot re-configures the other parts of the domain model pro-actively. To improve the accuracy of extracted domain models, we combine the techniques of Natural Language Processing and Machine Learning. Finally, we evaluate the accuracy of the extracted domain models.
Importation of SARS-CoV-2 following the "semaine de relache" and Quebec's (Canada) COVID-19 burden - a mathematical modeling study
Arnaud Godin
Yiqing Xia
Sharmistha Mishra
Dirk Douwes-Schultz
Yannan Shen
Maxime Lavigne
Mélanie Drolet
Alexandra M. Schmidt
Marc Brisson
Mathieu Maheu-Giroux
Background: The Canadian epidemics of COVID-19 exhibit distinct early trajectories, with Quebec bearing a very high initial burden. The sema… (see more)ine de relache, or March break, took place two weeks earlier in Quebec as compared to the rest of Canada. This event may have played a role in the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). We aimed to examine the role of case importation in the early transmission dynamics of SARS-CoV-2 in Quebec. Methods: Using detailed surveillance data, we developed and calibrated a deterministic SEIR-type compartmental model of SARS-CoV-2 transmission. We explored the impact of altering the number of imported cases on hospitalizations. Specifically, we investigated scenarios without case importation after March break, and as scenarios where cases were imported with the same frequency/timing as neighboring Ontario. Results: A total of 1,544 and 1,150 returning travelers were laboratory-confirmed in Quebec and Ontario, respectively (with symptoms onset before 2020-03-25). The cumulative number of hospitalizations could have been reduced by 55% (95% credible interval [95%CrI]: 51-59%) had no cases been imported after Quebec's March break. However, had Quebec experienced Ontario's number of imported cases, cumulative hospitalizations would have only been reduced by 12% (95%CrI: 8-16%). Interpretation: Our results suggest that case importation played an important role in the early spread of COVID-19 in Quebec. Yet, heavy importation of SARS-CoV-2 in early March could be insufficient to resolve interprovincial heterogeneities in cumulative hospitalizations. The importance of other factors -public health preparedness, responses, and capacity- should be investigated.
The role of case importation in explaining differences in early SARS-CoV-2 transmission dynamics in Canada—A mathematical modeling study of surveillance data
Arnaud Godin
Yiqing Xia
Sharmistha Mishra
Dirk Douwes-Schultz
Yannan Shen
Maxime Lavigne
Mélanie Drolet
Alexandra M. Schmidt
Marc Brisson
Mathieu Maheu-Giroux