Portrait de Jean-François Godbout

Jean-François Godbout

Membre académique associé
Professeur titulaire, Université de Montréal, Département de science politique
Alumni
Sujets de recherche
Désinformation
Modèles génératifs
Sécurité de l'IA

Biographie

Jean-François Godbout est professeur au département de science politique de l'Université de Montréal et membre académique associé à Mila - Institut québécois d'intelligence artificielle. Ses recherches portent principalement sur les sciences sociales computationnelles, la sécurité de l'IA et l'impact de l'IA générative sur la société. Il est actuellement directeur du programme de premier cycle en analyse des données en sciences humaines à l'Université de Montréal et chercheur à IVADO.

Étudiants actuels

Postdoctorat - UdeM
Doctorat - UdeM
Maîtrise recherche - UdeM
Co-superviseur⋅e :
Maîtrise recherche - UdeM
Co-superviseur⋅e :

Publications

CrediBench: Building Web-Scale Network Datasets for Information Integrity
Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persua… (voir plus)sive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training
Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethi… (voir plus)cally or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
$\texttt{BluePrint}$: A Social Media User Dataset for LLM Persona Evaluation and Training
Large language models (LLMs) offer promising capabilities for simulating social media dynamics at scale, enabling studies that would be ethi… (voir plus)cally or logistically challenging with human subjects. However, the field lacks standardized data resources for fine-tuning and evaluating LLMs as realistic social media agents. We address this gap by introducing SIMPACT, the SIMulation-oriented Persona and Action Capture Toolkit, a privacy respecting framework for constructing behaviorally-grounded social media datasets suitable for training agent models. We formulate next-action prediction as a task for training and evaluating LLM-based agents and introduce metrics at both the cluster and population levels to assess behavioral fidelity and stylistic realism. As a concrete implementation, we release BluePrint, a large-scale dataset built from public Bluesky data focused on political discourse. BluePrint clusters anonymized users into personas of aggregated behaviours, capturing authentic engagement patterns while safeguarding privacy through pseudonymization and removal of personally identifiable information. The dataset includes a sizable action set of 12 social media interaction types (likes, replies, reposts, etc.), each instance tied to the posting activity preceding it. This supports the development of agents that use context-dependence, not only in the language, but also in the interaction behaviours of social media to model social media users. By standardizing data and evaluation protocols, SIMPACT provides a foundation for advancing rigorous, ethically responsible social media simulations. BluePrint serves as both an evaluation benchmark for political discourse modeling and a template for building domain specific datasets to study challenges such as misinformation and polarization.
CrediBench: Building Web-Scale Network Datasets for Information Integrity
Online misinformation poses an escalating threat, amplified by the Internet's open nature and increasingly capable LLMs that generate persua… (voir plus)sive yet deceptive content. Existing misinformation detection methods typically focus on either textual content or network structure in isolation, failing to leverage the rich, dynamic interplay between website content and hyperlink relationships that characterizes real-world misinformation ecosystems. We introduce CrediBench: a large-scale data processing pipeline for constructing temporal web graphs that jointly model textual content and hyperlink structure for misinformation detection. Unlike prior work, our approach captures the dynamic evolution of general misinformation domains, including changes in both content and inter-site references over time. Our processed one-month snapshot extracted from the Common Crawl archive in December 2024 contains 45 million nodes and 1 billion edges, representing the largest web graph dataset made publicly available for misinformation research to date. From our experiments on this graph snapshot, we demonstrate the strength of both structural and webpage content signals for learning credibility scores, which measure source reliability. The pipeline and experimentation code are all available here, and the dataset is in this folder.
OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically… (voir plus) sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.
OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically… (voir plus) sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.
OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically… (voir plus) sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.
OpenFake: An Open Dataset and Platform Toward Real-World Deepfake Detection
Deepfakes, synthetic media created using advanced AI techniques, pose a growing threat to information integrity, particularly in politically… (voir plus) sensitive contexts. This challenge is amplified by the increasing realism of modern generative models, which our human perception study confirms are often indistinguishable from real images. Yet, existing deepfake detection benchmarks rely on outdated generators or narrowly scoped datasets (e.g., single-face imagery), limiting their utility for real-world detection. To address these gaps, we present OpenFake, a large politically grounded dataset specifically crafted for benchmarking against modern generative models with high realism, and designed to remain extensible through an innovative crowdsourced adversarial platform that continually integrates new hard examples. OpenFake comprises nearly four million total images: three million real images paired with descriptive captions and almost one million synthetic counterparts from state-of-the-art proprietary and open-source models. Detectors trained on OpenFake achieve near-perfect in-distribution performance, strong generalization to unseen generators, and high accuracy on a curated in-the-wild social media test set, significantly outperforming models trained on existing datasets. Overall, we demonstrate that with high-quality and continually updated benchmarks, automatic deepfake detection is both feasible and effective in real-world settings.
SandboxSocial: A Sandbox for Social Media Using Multimodal AI Agents
Gayatri Krishnakumar
Busra Tugce Gurbuz
Austin Welch
Hao Yu
Ethan Kosak-Hine
Tom Gibbs
Dan Zhao
SandboxSocial: A Sandbox for Social Media Using Multimodal AI Agents
Gayatri K
Busra Tugce Gurbuz
Austin Welch
Hao Yu
Ethan Kosak-Hine
Tom Gibbs
Dan Zhao
The online information ecosystem enables influence campaigns of unprecedented scale and impact. We urgently need empirically grounded approa… (voir plus)ches to counter the growing threat of malicious campaigns, now amplified by generative AI. But, developing defenses in real-world settings is impractical. Social system simulations with agents modelled using Large Language Models (LLMs) are a promising alternative approach and a growing area of research. However, existing simulators lack features needed to capture the complex information-sharing dynamics of platform-based social networks. To bridge this gap, we present SandboxSocial, a new simulator that includes several key innovations, mainly: (1) a virtual social media platform (modelled as Mastodon and mirrored in an actual Mastodon server) that enables a realistic setting in which agents interact; (2) an adapter that uses real-world user data to create more grounded agents and social media content; and (3) multi-modal capabilities that enable our agents to interact using both text and images---just as humans do on social media. We make the simulator more useful to researchers by providing measurement and analysis tools that track simulation dynamics and compute evaluation metrics to compare experimental results.
Veracity: An Open-Source AI Fact-Checking System.
William Garneau
Manon Gruaz
Li Wei Wang
Sukanya Krishna
Luda Cohen
Veracity: An Open-Source AI Fact-Checking System
William Garneau
Manon Gruaz
Li Wei Wang
Sukanya Krishna
Luda Cohen
The proliferation of misinformation poses a significant threat to society, exacerbated by the capabilities of generative AI. This demo paper… (voir plus) introduces Veracity, an open-source AI system designed to empower individuals to combat misinformation through transparent and accessible fact-checking. Veracity leverages the synergy between Large Language Models (LLMs) and web retrieval agents to analyze user-submitted claims and provide grounded veracity assessments with intuitive explanations. Key features include multilingual support, numerical scoring of claim veracity, and an interactive interface inspired by familiar messaging applications. This paper will showcase Veracity's ability to not only detect misinformation but also explain its reasoning, fostering media literacy and promoting a more informed society.