Publications

Uncovering Hidden Factions through Text-Network Representations: Unsupervised Public Opinion Mapping of Iran on Twitter in the 2022 Unrest

Ideological mapping on social media is typically framed as a supervised classification task that depends on stable party systems and abundan… (voir plus)t annotated data. These assumptions fail in contexts with weak political institutionalization, such as Iran. We recast ideology detection as a fully unsupervised mapping problem and introduce a text-network representation system, uncovering latent ideological factions on Persian Twitter during the 2022 Mahsa Amini protests. Using hundreds of millions of Persian tweets, we learn joint text–network embeddings by fine-tuning ParsBERT with a combined masked-language-modeling and contrastive objective and by passing the embeddings through a Graph Attention Network trained for link prediction on time-batched subgraphs. The pipeline integrates semantic and structural signals without observing labels. Density-based clustering reveals eight ideological blocs whose spatial relations mirror known political alliances. Alignment with 883 expert-labeled accounts yields 53% accuracy. This label-free framework scales to label-scarce contexts, offering new leverage for studying political debates online.

2025-07-26

colmweb.org/COLM/2025/Workshop/NLPOR (publié)

openreview.net

Uncovering Hidden Factions through Text-Network Representations: Unsupervised Public Opinion Mapping of Iran on Twitter in the 2022 Unrest

Sahar Omidi Shayegan

Jean-François Godbout

Reihaneh Rabbany

Ideological mapping on social media is typically framed as a supervised classification task that depends on stable party systems and abundan… (voir plus)t annotated data. These assumptions fail in contexts with weak political institutionalization, such as Iran. We recast ideology detection as a fully unsupervised mapping problem and introduce a text-network representation system, uncovering latent ideological factions on Persian Twitter during the 2022 Mahsa Amini protests. Using hundreds of millions of Persian tweets, we learn joint text–network embeddings by fine-tuning ParsBERT with a combined masked-language-modeling and contrastive objective and by passing the embeddings through a Graph Attention Network trained for link prediction on time-batched subgraphs. The pipeline integrates semantic and structural signals without observing labels. Density-based clustering reveals eight ideological blocs whose spatial relations mirror known political alliances. Alignment with 883 expert-labeled accounts yields 53% accuracy. This label-free framework scales to label-scarce contexts, offering new leverage for studying political debates online.

2025-07-26

colmweb.org/COLM/2025/Workshop/NLPOR (publié)

openreview.net

What Can Grokking Teach Us About Learning Under Nonstationarity?

Clare Lyle

Gharda Sokar

Razvan Pascanu

Andr'as Gyorgy

In continual learning problems, it is often necessary to overwrite components of a neural network's learned representation in response to ch… (voir plus)anges in the data stream; however, neural networks often exhibit \primacy bias, whereby early training data hinders the network's ability to generalize on later tasks. While feature-learning dynamics of nonstationary learning problems are not well studied, the emergence of feature-learning dynamics is known to drive the phenomenon of grokking, wherein neural networks initially memorize their training data and only later exhibit perfect generalization. This work conjectures that the same feature-learning dynamics which facilitate generalization in grokking also underlie the ability to overwrite previous learned features as well, and methods which accelerate grokking by facilitating feature-learning dynamics are promising candidates for addressing primacy bias in non-stationary learning problems. We then propose a straightforward method to induce feature-learning dynamics as needed throughout training by increasing the effective learning rate, i.e. the ratio between parameter and update norms. We show that this approach both facilitates feature-learning and improves generalization in a variety of settings, including grokking, warm-starting neural network training, and reinforcement learning tasks.

2025-07-26

ArXiv (prépublication)

doi.org

arxiv.org

Comparative genomics of Pseudomonas paraeruginosa.

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

D. Centrón

Jacques Corbeil

Paul H Roy

The PA7-clade (or group 3) of Pseudomonas aeruginosa is now recognized as a distinct species, Pseudomonas paraeruginosa. We report here the … (voir plus)genomic sequences of six new strains of P. paraeruginosa: Zw26 (the first complete genome of a cystic fibrosis isolate of P. paraeruginosa), draft genomes of four burn and wound strains from Argentina very closely related to PA7, and of Pa5196, the strain in which arabinosylation of type IV pili was documented. We compared the genomes of 82 strains of P. paraeruginosa and confirmed that the species is divided into two sub-clades. Core genomes are very similar, while most differences are found in "regions of genomic plasticity" (RGPs). Several genomic deletions were identified, and most are common to the CR1 sub-clade that includes Zw26 and Pa5196. All strains lack the type 3 secretion system (T3SS) and instead use an alternative virulence strategy involving an exolysin, a characteristic shared with group 5 P. aeruginosa. All strains tend to be multiresistant like PA7, with a significant proportion of carbapenem-resistant strains, either oprD mutants or carrying carbapenemase genes. Although P. paraeruginosa is still relatively rare, it has a worldwide distribution. Its multiresistance and its alternative virulence strategy need to be considered in future therapeutic development.IMPORTANCEPseudomonas aeruginosa is an important opportunistic pathogen causing respiratory infections, notably in cystic fibrosis, and burn and wound infections. Our study reports six new genomes of Pseudomonas paraeruginosa, a new species recently reported as distinct from P. aeruginosa. The number of sequenced genomes of P. paraeruginosa is only about 1% that of P. aeruginosa. We compare the genomic content of nearly all strains of P. paraeruginosa in GenBank, highlighting the differences in core and accessory genomes, antimicrobial resistance genes, and virulence factors. This novel species is very similar in environmental spectrum to P. aeruginosa but is notably resistant to last-line antibiotics and uses an alternative virulence strategy based on exolysin-this strategy being shared with some P. aeruginosa outliers.

2025-07-25

Journal of Bacteriology (publié)

doi.org

Comparative genomics of Pseudomonas paraeruginosa

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

Daniela Centrón

Jacques Corbeil

Paul H Roy

ABSTRACT The PA7-clade (or group 3) of Pseudomonas aeruginosa is now recognized as a distinct species, Pseudomonas paraeruginosa. We report … (voir plus)here the genomic sequences of six new strains of P. paraeruginosa: Zw26 (the first complete genome of a cystic fibrosis isolate of P. paraeruginosa), draft genomes of four burn and wound strains from Argentina very closely related to PA7, and of Pa5196, the strain in which arabinosylation of type IV pili was documented. We compared the genomes of 82 strains of P. paraeruginosa and confirmed that the species is divided into two sub-clades. Core genomes are very similar, while most differences are found in “regions of genomic plasticity” (RGPs). Several genomic deletions were identified, and most are common to the CR1 sub-clade that includes Zw26 and Pa5196. All strains lack the type 3 secretion system (T3SS) and instead use an alternative virulence strategy involving an exolysin, a characteristic shared with group 5 P. aeruginosa. All strains tend to be multiresistant like PA7, with a significant proportion of carbapenem-resistant strains, either oprD mutants or carrying carbapenemase genes. Although P. paraeruginosa is still relatively rare, it has a worldwide distribution. Its multiresistance and its alternative virulence strategy need to be considered in future therapeutic development. IMPORTANCE Pseudomonas aeruginosa is an important opportunistic pathogen causing respiratory infections, notably in cystic fibrosis, and burn and wound infections. Our study reports six new genomes of Pseudomonas paraeruginosa, a new species recently reported as distinct from P. aeruginosa. The number of sequenced genomes of P. paraeruginosa is only about 1% that of P. aeruginosa. We compare the genomic content of nearly all strains of P. paraeruginosa in GenBank, highlighting the differences in core and accessory genomes, antimicrobial resistance genes, and virulence factors. This novel species is very similar in environmental spectrum to P. aeruginosa but is notably resistant to last-line antibiotics and uses an alternative virulence strategy based on exolysin—this strategy being shared with some P. aeruginosa outliers.

2025-07-25

Journal of Bacteriology (publié)

doi.org

Comparative genomics of
<i>Pseudomonas paraeruginosa</i>

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

Daniela Centrón

Jacques Corbeil

Paul H Roy

2025-07-25

Journal of Bacteriology (publié)

doi.org

Comparative genomics of Pseudomonas paraeruginosa

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

Daniela Centrón

Jacques Corbeil

Paul H Roy

ABSTRACT The PA7-clade (or group 3) of Pseudomonas aeruginosa is now recognized as a distinct species, Pseudomonas paraeruginosa. We report … (voir plus)here the genomic sequences of six new strains of P. paraeruginosa: Zw26 (the first complete genome of a cystic fibrosis isolate of P. paraeruginosa), draft genomes of four burn and wound strains from Argentina very closely related to PA7, and of Pa5196, the strain in which arabinosylation of type IV pili was documented. We compared the genomes of 82 strains of P. paraeruginosa and confirmed that the species is divided into two sub-clades. Core genomes are very similar, while most differences are found in “regions of genomic plasticity” (RGPs). Several genomic deletions were identified, and most are common to the CR1 sub-clade that includes Zw26 and Pa5196. All strains lack the type 3 secretion system (T3SS) and instead use an alternative virulence strategy involving an exolysin, a characteristic shared with group 5 P. aeruginosa. All strains tend to be multiresistant like PA7, with a significant proportion of carbapenem-resistant strains, either oprD mutants or carrying carbapenemase genes. Although P. paraeruginosa is still relatively rare, it has a worldwide distribution. Its multiresistance and its alternative virulence strategy need to be considered in future therapeutic development. IMPORTANCE Pseudomonas aeruginosa is an important opportunistic pathogen causing respiratory infections, notably in cystic fibrosis, and burn and wound infections. Our study reports six new genomes of Pseudomonas paraeruginosa, a new species recently reported as distinct from P. aeruginosa. The number of sequenced genomes of P. paraeruginosa is only about 1% that of P. aeruginosa. We compare the genomic content of nearly all strains of P. paraeruginosa in GenBank, highlighting the differences in core and accessory genomes, antimicrobial resistance genes, and virulence factors. This novel species is very similar in environmental spectrum to P. aeruginosa but is notably resistant to last-line antibiotics and uses an alternative virulence strategy based on exolysin—this strategy being shared with some P. aeruginosa outliers.

2025-07-25

Journal of Bacteriology (publié)

doi.org

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

A Generative Approach to LLM Harmfulness Mitigation with Red Flag Tokens

Most safety training methods for large language models (LLMs) are based on fine-tuning that forces models to shift from an unsafe answer to … (voir plus)refusal when faced with harmful requests. Unfortunately, these drastic distribution shifts generally compromise model capabilities. To avoid that, we propose to expand the model's vocabulary with a special token we call *red flag token* (

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

Neither Valid nor Reliable? Investigating the Use of LLMs as Judges

Khaoula Chehbouni

Mohammed Haddou

Jackie Cheung

Golnoosh Farnadi

Evaluating natural language generation (NLG) systems remains a core challenge of natural language processing (NLP), further complicated by t… (voir plus)he rise of large language models (LLMs) that aims to be general-purpose. Recently, LLMs as judges (LLJs) have emerged as a promising alternative to traditional metrics, but their validity remains underexplored. This position paper argues that the current enthusiasm around LLJs may be premature, as their adoption has outpaced rigorous scrutiny of their reliability and validity as evaluators. Drawing on measurement theory from the social sciences, we identify and critically assess four core assumptions underlying the use of LLJs: their ability to act as proxies for human judgment, their capabilities as evaluators, their scalability, and their cost-effectiveness. We examine how each of these assumptions may be challenged by the inherent limitations of LLMs, LLJs, or current practices in NLG evaluation.

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

ReCatcher: Towards LLMs Regression Testing for Code Generation

Altaf Allah Abbassi

Leuson Da Silva

Amin Nikanjam

Foutse Khomh

Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates c… (voir plus)an introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.

2025-07-25

ArXiv (prépublication)

arxiv.org

TRUTH: Teaching LLMs to Rerank for Truth in Misinformation Detection

Hao Yu

Shenyang Huang

Zachary Yang

Maximilian Puelma Touzel

Kellin Pelrine

Jean-François Godbout

Reihaneh Rabbany

Misinformation detection presents a significant challenge due to its knowledge-intensive and reasoning-intensive nature. While Retrieval-Aug… (voir plus)mented Generation (RAG) systems offer a promising direction, the effectiveness of their retrieval and reranking components is crucial. This paper introduces TRUTH, a novel reranking approach designed for domain adaptation, specifically for misinformation detection, which employs a two-stage training methodology: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). We demonstrate that our 1B parameter TRUTH model achieves strong performance comparable to 7B models on established misinformation benchmarks such as FEVER and Canadian bilingual news datasets, improving retrieval quality and positively impacting downstream task accuracy. Our findings highlight the efficacy of combining SFT for broad knowledge acquisition and domain adaptation with DPO for nuanced reasoning alignment in developing efficient and effective rerankers for complex, knowledge-intensive tasks. Datasets and code will be available with the camera-ready version of the paper.

2025-07-25

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

Programme d’apprentissage IA sur mesure

Mil'Haq Fest 2025

Communauté de pratique de Mila

Demandes de supervision

Publications

Programme d’apprentissage IA sur mesure

Mil'Haq Fest 2025

Communauté de pratique de Mila

Demandes de supervision

Mots-clés populaires:

Publications