Kellin Pelrine

SWEET: Weakly Supervised Person Name Extraction for Fighting Human Trafficking

Javin Liu

Hao Yu

2023-11-30

Findings of the Association for Computational Linguistics: EMNLP 2023 (published)

openreview.net

Towards Reliable Misinformation Mitigation: Generalization, Uncertainty, and GPT-4

Anne Imouza

Meilina Reksoprodjo

Camille Thibault

Caleb Gupta

Joel Christoph

Published online: 24 May 2023

2023-10-06

EMNLP/2023/Conference (accepted)

openreview.net

Open, Closed, or Small Language Models for Text Classification?

Hao Yu

Zachary Yang

Recent advancements in large language models have demonstrated remarkable capabilities across various NLP tasks. But many questions remain, … (see more)including whether open-source models match closed ones, why these models excel or struggle with certain tasks, and what types of practical procedures can improve performance. We address these questions in the context of classification by evaluating three classes of models using eight datasets across three distinct tasks: named entity recognition, political party prediction, and misinformation detection. While larger LLMs often lead to improved performance, open-source models can rival their closed-source counterparts by fine-tuning. Moreover, supervised smaller models, like RoBERTa, can achieve similar or even greater performance in many datasets compared to generative LLMs. On the other hand, closed models maintain an advantage in hard tasks that demand the most generalizability. This study underscores the importance of model selection based on task requirements

2023-08-18

ArXiv (preprint)

Party Prediction for Twitter

Sacha Lévy

Gabrielle Desrosiers-Brisebois

Aarash Feizi

Cécile Amadoro

Andre Blais

A large number of studies on social media compare the behaviour of users from different political parties. As a basic step, they employ a pr… (see more)edictive model for inferring their political affiliation. The accuracy of this model can change the conclusions of a downstream analysis significantly, yet the choice between different models seems to be made arbitrarily. In this paper, we provide a comprehensive survey and an empirical comparison of the current party prediction practices and propose several new approaches which are competitive with or outperform state-of-the-art methods, yet require less computational resources. Party prediction models rely on the content generated by the users (e.g., tweet texts), the relations they have (e.g., who they follow), or their activities and interactions (e.g., which tweets they like). We examine all of these and compare their signal strength for the party prediction task. This paper lets the practitioner select from a wide range of data types that all give strong performance. Finally, we conduct extensive experiments on different aspects of these methods, such as data collection speed and transfer capabilities, which can provide further insights for both applied and methodological research.

2022-12-31

arXiv (preprint)

Active Keyword Selection to Track Evolving Topics on Twitter

Sacha Lévy

Farimah Poursafaei

How can we study social interactions on evolving topics at a mass scale? Over the past decade, researchers from diverse fields such as econo… (see more)mics, political science, and public health have often done this by querying Twitter's public API endpoints with hand-picked topical keywords to search or stream discussions. However, despite the API's accessibility, it remains difficult to select and update keywords to collect high-quality data relevant to topics of interest. In this paper, we propose an active learning method for rapidly refining query keywords to increase both the yielded topic relevance and dataset size. We leverage a large open-source COVID-19 Twitter dataset to illustrate the applicability of our method in tracking Tweets around the key sub-topics of Vaccine, Mask, and Lockdown. Our experiments show that our method achieves an average topic-related keyword recall 2x higher than baselines. We open-source our code along with a web interface for keyword selection to make data collection from Twitter more systematic for researchers.

2022-10-31

2022 IEEE International Conference on Data Mining Workshops (ICDMW) (published)

Extracting Person Names from User Generated Text: Named-Entity Recognition for Combating Human Trafficking

2021-12-31

Findings (published)

Towards Better Evaluation for Dynamic Link Prediction

Andy Huang

Despite the prevalence of recent success in learning from static graphs, learning from time-evolving graphs remains an open challenge. In th… (see more)is work, we design new, more stringent evaluation procedures for link prediction specific to dynamic graphs, which reflect real-world considerations, to better compare the strengths and weaknesses of methods. First, we create two visualization techniques to understand the reoccurring patterns of edges over time and show that many edges reoccur at later time steps. Based on this observation, we propose a pure memorization-based baseline called EdgeBank. EdgeBank achieves surprisingly strong performance across multiple settings which highlights that the negative edges used in the current evaluation are easy. To sample more challenging negative edges, we introduce two novel negative sampling strategies that improve robustness and better match real-world applications. Lastly, we introduce six new dynamic graph datasets from a diverse set of domains missing from current benchmarks, providing new challenges and opportunities for future research. Our code repository is accessible at https://github.com/fpour/DGB.git.

2021-12-31

Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (published)

openreview.net

Online Partisan Polarization of COVID-19

Sacha Lévy

Gabrielle Desrosiers-Brisebois

Andre Blais

In today’s age of (mis)information, many people utilize various social media platforms in an attempt to shape public opinion on sever… (see more)al important issues, including elections and the COVID-19 pandemic. These two topics have recently become intertwined given the importance of complying with public health measures related to COVID-19 and politicians’ management of the pandemic. Motivated by this, we study the partisan polarization of COVID-19 discussions on social media. We propose and utilize a novel measure of partisan polarization to analyze more than 380 million posts from Twitter and Parler around the 2020 US presidential election. We find strong correlation between peaks in polarization and polarizing events, such as the January 6^th Capitol Hill riot. We further classify each post into key COVID-19 issues of lockdown, masks, vaccines, as well as miscellaneous, to investigate both the volume and polarization on these topics and how they vary through time. Parler includes more negative discussions around lockdown and masks, as expected, but not much around vaccines. We also observe more balanced discussions on Twitter and a general disconnect between the discussions on Parler and Twitter.

2021-11-30

2021 International Conference on Data Mining Workshops (ICDMW) (unknown)

The Surprising Performance of Simple Baselines for Misinformation Detection

Jacob Danovitch

As social media becomes increasingly prominent in our day to day lives, it is increasingly important to detect informative content and preve… (see more)nt the spread of disinformation and unverified rumours. While many sophisticated and successful models have been proposed in the literature, they are often compared with older NLP baselines such as SVMs, CNNs, and LSTMs. In this paper, we examine the performance of a broad set of modern transformer-based language models and show that with basic fine-tuning, these models are competitive with and can even significantly outperform recently proposed state-of-the-art methods. We present our framework as a baseline for creating and evaluating new methods for misinformation detection. We further study a comprehensive set of benchmark datasets, and discuss potential data leakage and the need for careful design of the experiments and understanding of datasets to account for confounding variables. As an extreme case example, we show that classifying only based on the first three digits of tweet ids, which contain information on the date, gives state-of-the-art performance on a commonly used benchmark dataset for fake news detection --Twitter16. We provide a simple tool to detect this problem and suggest steps to mitigate it in future datasets.

2021-04-18

Proceedings of the Web Conference 2021 (published)

ComplexDataLab at WNUT-2020 Task 2: Detecting Informative COVID-19 Tweets by Attending over Linked Documents

Jacob Danovitch

Albert Orozco Camacho

Given the global scale of COVID-19 and the flood of social media content related to it, how can we find informative discussions? We present … (see more)Gapformer, which effectively classifies content as informative or not. It reformulates the problem as graph classification, drawing on not only the tweet but connected webpages and entities. We leverage a pre-trained language model as well as the connections between nodes to learn a pooled representation for each document network. We show it outperforms several competitive baselines and present ablation studies supporting the benefit of the linked information. Code is available on Github.

2020-10-31

WNUT (published)