Publications

Global Surveillance of COVID-19 by mining news media using a multi-source dynamic embedded topic model
Pratheeksha Nair
Zhi Wen
Imane Chafi
Anya Okhmatovskaia
Guido Powell
Yannan Shen
On Posterior Collapse and Encoder Feature Dispersion in Sequence VAEs.
Teng Long
Yanshuai Cao
Variational autoencoders (VAEs) hold great potential for modelling text, as they could in theory separate high-level semantic and syntactic … (see more)properties from local regularities of natural language. Practically, however, VAEs with autoregressive decoders often suffer from posterior collapse, a phenomenon where the model learns to ignore the latent variables, causing the sequence VAE to degenerate into a language model. In this paper, we argue that posterior collapse is in part caused by the lack of dispersion in encoder features. We provide empirical evidence to verify this hypothesis, and propose a straightforward fix using pooling. This simple technique effectively prevents posterior collapse, allowing model to achieve significantly better data log-likelihood than standard sequence VAEs. Comparing to existing work, our proposed method is able to achieve comparable or superior performances while being more computationally efficient.
Approximate Planning and Learning for Partially Observed Systems
Effectiveness of quarantine and testing to prevent COVID-19 transmission from arriving travelers
Russell Wa
Explainability and Interpretability: Keys to Deep Medicine
Arash Shaban-Nejad
Martin Michalowski
Bisimulation metrics and norms for real-weighted automata
Borja Balle
Pascale Gourdeau
ComplexDataLab at W-NUT 2020 Task 2: Detecting Informative COVID-19 Tweets by Attending over Linked Documents
Kellin Pelrine
Jacob Danovitch
Albert Orozco Camacho
Given the global scale of COVID-19 and the flood of social media content related to it, how can we find informative discussions? We present … (see more)Gapformer, which effectively classifies content as informative or not. It reformulates the problem as graph classification, drawing on not only the tweet but connected webpages and entities. We leverage a pre-trained language model as well as the connections between nodes to learn a pooled representation for each document network. We show it outperforms several competitive baselines and present ablation studies supporting the benefit of the linked information. Code is available on Github.
Deconstructing Word Embedding Algorithms
Kian Kenyon-Dean
Edward Daniel Newell
Factual Error Correction for Abstractive Summarization Models
Meng Cao
Yue Dong
Jiapeng Wu
Neural abstractive summarization systems have achieved promising progress, thanks to the availability of large-scale datasets and models pre… (see more)-trained with self-supervised methods. However, ensuring the factual consistency of the generated summaries for abstractive summarization systems is a challenge. We propose a post-editing corrector module to address this issue by identifying and correcting factual errors in generated summaries. The neural corrector model is pre-trained on artificial examples that are created by applying a series of heuristic transformations on reference summaries. These transformations are inspired by an error analysis of state-of-the-art summarization model outputs. Experimental results show that our model is able to correct factual errors in summaries generated by other neural summarization models and outperforms previous models on factual consistency evaluation on the CNN/DailyMail dataset. We also find that transferring from artificial error correction to downstream settings is still very challenging.
MeDAL: Medical Abbreviation Disambiguation Dataset for Natural Language Understanding Pretraining
Zhi Wen
Xing Han Lu
Multi-Fact Correction in Abstractive Text Summarization
Yue Dong
Shuohang Wang
Zhe Gan
Yu Cheng
Jingjing Liu
Multi-XScience: A Large-scale Dataset for Extreme Multi-document Summarization of Scientific Articles
Yao Lu
Yue Dong
Multi-document summarization is a challenging task for which there exists little large-scale datasets. We propose Multi-XScience, a large-sc… (see more)ale multi-document summarization dataset created from scientific articles. Multi-XScience introduces a challenging multi-document summarization task: writing the related-work section of a paper based on its abstract and the articles it references. Our work is inspired by extreme summarization, a dataset construction protocol that favours abstractive modeling approaches. Descriptive statistics and empirical results—using several state-of-the-art models trained on the Multi-XScience dataset—reveal that Multi-XScience is well suited for abstractive models.