TRAIL : IA responsable pour les professionnels et les leaders
Apprenez à intégrer des pratique d'IA responsable dans votre organisation avec le programme TRAIL. Inscrivez-vous à la séance d'information le 12 mars prochain pour en apprendre plus sur le programme.
Avantage IA : productivité dans la fonction publique
Apprenez à tirer parti de l’IA générative pour soutenir et améliorer votre productivité au travail. La prochaine cohorte se déroulera en ligne les 28 et 30 avril 2026.
Nous utilisons des témoins pour analyser le trafic et l’utilisation de notre site web, afin de personnaliser votre expérience. Vous pouvez désactiver ces technologies à tout moment, mais cela peut restreindre certaines fonctionnalités du site. Consultez notre Politique de protection de la vie privée pour en savoir plus.
Paramètre des cookies
Vous pouvez activer et désactiver les types de cookies que vous souhaitez accepter. Cependant certains choix que vous ferez pourraient affecter les services proposés sur nos sites (ex : suggestions, annonces personnalisées, etc.).
Cookies essentiels
Ces cookies sont nécessaires au fonctionnement du site et ne peuvent être désactivés. (Toujours actif)
Cookies analyse
Acceptez-vous l'utilisation de cookies pour mesurer l'audience de nos sites ?
Lecteur Multimédia
Acceptez-vous l'utilisation de cookies pour afficher et vous permettre de regarder les contenus vidéo hébergés par nos partenaires (YouTube, etc.) ?
Recent advancements in large language models have demonstrated remarkable capabilities across various NLP tasks. But many questions remain, … (voir plus)including whether open-source models match closed ones, why these models excel or struggle with certain tasks, and what types of practical procedures can improve performance. We address these questions in the context of classification by evaluating three classes of models using eight datasets across three distinct tasks: named entity recognition, political party prediction, and misinformation detection. While larger LLMs often lead to improved performance, open-source models can rival their closed-source counterparts by fine-tuning. Moreover, supervised smaller models, like RoBERTa, can achieve similar or even greater performance in many datasets compared to generative LLMs. On the other hand, closed models maintain an advantage in hard tasks that demand the most generalizability. This study underscores the importance of model selection based on task requirements
A large number of studies on social media compare the behaviour of users from different political parties. As a basic step, they employ a pr… (voir plus)edictive model for inferring their political affiliation. The accuracy of this model can change the conclusions of a downstream analysis significantly, yet the choice between different models seems to be made arbitrarily. In this paper, we provide a comprehensive survey and an empirical comparison of the current party prediction practices and propose several new approaches which are competitive with or outperform state-of-the-art methods, yet require less computational resources. Party prediction models rely on the content generated by the users (e.g., tweet texts), the relations they have (e.g., who they follow), or their activities and interactions (e.g., which tweets they like). We examine all of these and compare their signal strength for the party prediction task. This paper lets the practitioner select from a wide range of data types that all give strong performance. Finally, we conduct extensive experiments on different aspects of these methods, such as data collection speed and transfer capabilities, which can provide further insights for both applied and methodological research.
How can we study social interactions on evolving topics at a mass scale? Over the past decade, researchers from diverse fields such as econo… (voir plus)mics, political science, and public health have often done this by querying Twitter's public API endpoints with hand-picked topical keywords to search or stream discussions. However, despite the API's accessibility, it remains difficult to select and update keywords to collect high-quality data relevant to topics of interest. In this paper, we propose an active learning method for rapidly refining query keywords to increase both the yielded topic relevance and dataset size. We leverage a large open-source COVID-19 Twitter dataset to illustrate the applicability of our method in tracking Tweets around the key sub-topics of Vaccine, Mask, and Lockdown. Our experiments show that our method achieves an average topic-related keyword recall 2x higher than baselines. We open-source our code along with a web interface for keyword selection to make data collection from Twitter more systematic for researchers.
2022-10-31
2022 IEEE International Conference on Data Mining Workshops (ICDMW) (publié)
Despite the prevalence of recent success in learning from static graphs, learning from time-evolving graphs remains an open challenge. In th… (voir plus)is work, we design new, more stringent evaluation procedures for link prediction specific to dynamic graphs, which reflect real-world considerations, to better compare the strengths and weaknesses of methods. First, we create two visualization techniques to understand the reoccurring patterns of edges over time and show that many edges reoccur at later time steps. Based on this observation, we propose a pure memorization-based baseline called EdgeBank. EdgeBank achieves surprisingly strong performance across multiple settings which highlights that the negative edges used in the current evaluation are easy. To sample more challenging negative edges, we introduce two novel negative sampling strategies that improve robustness and better match real-world applications. Lastly, we introduce six new dynamic graph datasets from a diverse set of domains missing from current benchmarks, providing new challenges and opportunities for future research. Our code repository is accessible at https://github.com/fpour/DGB.git.
2021-12-31
Advances in Neural Information Processing Systems 35 (NeurIPS 2022) (publié)
In today’s age of (mis)information, many people utilize various social media platforms in an attempt to shape public opinion on sever… (voir plus)al important issues, including elections and the COVID-19 pandemic. These two topics have recently become intertwined given the importance of complying with public health measures related to COVID-19 and politicians’ management of the pandemic. Motivated by this, we study the partisan polarization of COVID-19 discussions on social media. We propose and utilize a novel measure of partisan polarization to analyze more than 380 million posts from Twitter and Parler around the 2020 US presidential election. We find strong correlation between peaks in polarization and polarizing events, such as the January 6th Capitol Hill riot. We further classify each post into key COVID-19 issues of lockdown, masks, vaccines, as well as miscellaneous, to investigate both the volume and polarization on these topics and how they vary through time. Parler includes more negative discussions around lockdown and masks, as expected, but not much around vaccines. We also observe more balanced discussions on Twitter and a general disconnect between the discussions on Parler and Twitter.
2021-11-30
2021 International Conference on Data Mining Workshops (ICDMW) (inconnu)
As social media becomes increasingly prominent in our day to day lives, it is increasingly important to detect informative content and preve… (voir plus)nt the spread of disinformation and unverified rumours. While many sophisticated and successful models have been proposed in the literature, they are often compared with older NLP baselines such as SVMs, CNNs, and LSTMs. In this paper, we examine the performance of a broad set of modern transformer-based language models and show that with basic fine-tuning, these models are competitive with and can even significantly outperform recently proposed state-of-the-art methods. We present our framework as a baseline for creating and evaluating new methods for misinformation detection. We further study a comprehensive set of benchmark datasets, and discuss potential data leakage and the need for careful design of the experiments and understanding of datasets to account for confounding variables. As an extreme case example, we show that classifying only based on the first three digits of tweet ids, which contain information on the date, gives state-of-the-art performance on a commonly used benchmark dataset for fake news detection --Twitter16. We provide a simple tool to detect this problem and suggest steps to mitigate it in future datasets.
Given the global scale of COVID-19 and the flood of social media content related to it, how can we find informative discussions? We present … (voir plus)Gapformer, which effectively classifies content as informative or not. It reformulates the problem as graph classification, drawing on not only the tweet but connected webpages and entities. We leverage a pre-trained language model as well as the connections between nodes to learn a pooled representation for each document network. We show it outperforms several competitive baselines and present ablation studies supporting the benefit of the linked information. Code is available on Github.