Publications

Leveraging Structure Between Environments: Phylogenetic Regularization Incentivizes Disentangled Representations

Jason Hartford

Recently, learning invariant predictors across varying environments has been shown to improve the generalization of supervised learning meth… (see more)ods. This line of investigation holds great potential for application to biological problem settings, where data is often naturally heterogeneous. Biological samples often originate from different distributions, or environments. However, in biological contexts, the standard "invariant prediction" setting may not completely fit: the optimal predictor may in fact vary across biological environments. There also exists strong domain knowledge about the relationships between environments, such as the evolutionary history of a set of species, or the differentiation process of cell types. Most work on generic invariant predictors have not assumed the existence of structured relationships between environments. However, this prior knowledge about environments themselves has already been shown to improve prediction through a particular form of regularization applied when learning a set of predictors. In this work, we empirically evaluate whether a regularization strategy that exploits environment-based prior information can be used to learn representations that better disentangle causal factors that generate observed data. We find evidence that these methods do in fact improve the disentanglement of latent embeddings. We also show a setting where these methods can leverage phylogenetic information to estimate the number of latent causal features.

2025-07-25

Transactions on Machine Learning Research (accepted)

doi.org

openreview.net

Uncovering Hidden Factions through Text-Network Representations: Unsupervised Public Opinion Mapping of Iran on Twitter in the 2022 Unrest

Sahar Omidi Shayegan

Jean-François Godbout

Reihaneh Rabbany

Ideological mapping on social media is typically framed as a supervised classification task that depends on stable party systems and abundan… (see more)t annotated data. These assumptions fail in contexts with weak political institutionalization, such as Iran. We recast ideology detection as a fully unsupervised mapping problem and introduce a text-network representation system, uncovering latent ideological factions on Persian Twitter during the 2022 Mahsa Amini protests. Using hundreds of millions of Persian tweets, we learn joint text–network embeddings by fine-tuning ParsBERT with a combined masked-language-modeling and contrastive objective and by passing the embeddings through a Graph Attention Network trained for link prediction on time-batched subgraphs. The pipeline integrates semantic and structural signals without observing labels. Density-based clustering reveals eight ideological blocs whose spatial relations mirror known political alliances. Alignment with 883 expert-labeled accounts yields 53% accuracy. This label-free framework scales to label-scarce contexts, offering new leverage for studying political debates online.

2025-07-25

colmweb.org/COLM/2025/Workshop/NLPOR (published)

openreview.net

Comparative genomics of
<i>Pseudomonas paraeruginosa</i>

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

Daniela Centrón

J. Corbeil

Paul H Roy

2025-07-24

Journal of Bacteriology (published)

doi.org

Comparative genomics of Pseudomonas paraeruginosa

Maxime Déraspe

Lori L. Burrows

Romé Voulhoux

Daniela Centrón

J. Corbeil

Paul H Roy

ABSTRACT The PA7-clade (or group 3) of Pseudomonas aeruginosa is now recognized as a distinct species, Pseudomonas paraeruginosa. We report … (see more)here the genomic sequences of six new strains of P. paraeruginosa: Zw26 (the first complete genome of a cystic fibrosis isolate of P. paraeruginosa), draft genomes of four burn and wound strains from Argentina very closely related to PA7, and of Pa5196, the strain in which arabinosylation of type IV pili was documented. We compared the genomes of 82 strains of P. paraeruginosa and confirmed that the species is divided into two sub-clades. Core genomes are very similar, while most differences are found in “regions of genomic plasticity” (RGPs). Several genomic deletions were identified, and most are common to the CR1 sub-clade that includes Zw26 and Pa5196. All strains lack the type 3 secretion system (T3SS) and instead use an alternative virulence strategy involving an exolysin, a characteristic shared with group 5 P. aeruginosa. All strains tend to be multiresistant like PA7, with a significant proportion of carbapenem-resistant strains, either oprD mutants or carrying carbapenemase genes. Although P. paraeruginosa is still relatively rare, it has a worldwide distribution. Its multiresistance and its alternative virulence strategy need to be considered in future therapeutic development. IMPORTANCE Pseudomonas aeruginosa is an important opportunistic pathogen causing respiratory infections, notably in cystic fibrosis, and burn and wound infections. Our study reports six new genomes of Pseudomonas paraeruginosa, a new species recently reported as distinct from P. aeruginosa. The number of sequenced genomes of P. paraeruginosa is only about 1% that of P. aeruginosa. We compare the genomic content of nearly all strains of P. paraeruginosa in GenBank, highlighting the differences in core and accessory genomes, antimicrobial resistance genes, and virulence factors. This novel species is very similar in environmental spectrum to P. aeruginosa but is notably resistant to last-line antibiotics and uses an alternative virulence strategy based on exolysin—this strategy being shared with some P. aeruginosa outliers.

2025-07-24

Journal of Bacteriology (published)

doi.org

TRUTH: Teaching LLMs to Rerank for Truth in Misinformation Detection

Hao Yu

Shenyang Huang

Zachary Yang

Maximilian Puelma Touzel

Kellin Pelrine

Jean-François Godbout

Reihaneh Rabbany

2025-07-24

colmweb.org/COLM/2025/Workshop/SoLaR (poster)

openreview.net

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng

Meng Cao

Leila Pishdad

Yanshuai Cao

Jackie CK Cheung

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for r… (see more)easoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models first capture problem abstractions, then handle computation. Causal patching confirms these abstractions are present, transferable, composable, and precede computation. These behavioural and mechanistic findings highlight the need for disentangled evaluation to accurately assess LLM reasoning and to guide future improvements.

2025-07-23

colmweb.org/COLM/2025/Workshop/XLLM-Reason-Plan (published)

openreview.net

Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data

Jiaming Zhou

Abbas Ghaddar

Ge Zhang

Liheng Ma

Yaochen Hu

Soumyasundar Pal

Mark J. Coates

Jianye HAO

B. Wang

Yingxue Zhang

2025-07-23

colmweb.org/COLM/2025/Workshop/XLLM-Reason-Plan (published)

doi.org

openreview.net

Enhancing Changepoint Detection: Penalty Learning through Deep Learning Techniques

Tung L. Nguyen

Toby Dylan Hocking

2025-07-21

Statistics and Computing (published)

doi.org

arxiv.org

Pharmaco-nutraceutical improvement of the response to obeticholic acid with omega-3 polyunsaturated fatty acids

Audrey-Anne Lavoie

Ariane Thérien

Anisia Silva

Emanuel Paré

Anna Ciešlak

William Gagnon

Clémence Desjardins

Mélanie Verreault

Jocelyn Trottier

Marie-Claude Vohl

Jean-Philippe Drouin-Chartier

J. Corbeil

Alexandre Caron

Olivier Barbier

2025-07-21

Biochemical Journal (published)

doi.org

Tracing Optimization for Performance Modeling and Regression Detection

Kaveh Shahedi

Heng Li

Maxime Lamothe

Foutse Khomh

Software performance modeling plays a crucial role in developing and maintaining software systems. A performance model analytically describe… (see more)s the relationship between the performance of a system and its runtime activities. This process typically examines various aspects of a system's runtime behavior, such as the execution frequency of functions or methods, to forecast performance metrics like program execution time. By using performance models, developers can predict expected performance and thereby effectively identify and address unexpected performance regressions when actual performance deviates from the model's predictions. One common and precise method for capturing performance behavior is software tracing, which involves instrumenting the execution of a program, either at the kernel level (e.g., system calls) or application level (e.g., function calls). However, due to the nature of tracing, it can be highly resource-intensive, making it impractical for production environments where resources are limited. In this work, we propose statistical approaches to reduce tracing overhead by identifying and excluding performance-insensitive code regions, particularly application-level functions, from tracing while still building accurate performance models that can capture performance degradations. By selecting an optimal set of functions to be traced, we can construct optimized performance models that achieve an R-2 score of up to 99% and, sometimes, outperform full tracing models (models using non-optimized tracing data), while significantly reducing the tracing overhead by more than 80% in most cases. Our optimized performance models can also capture performance regressions in our studied programs effectively, demonstrating their usefulness in real-world scenarios. Our approach is fully automated, making it ready to be used in production environments with minimal human effort.

2025-07-21

ACM Transactions on Software Engineering and Methodology (published)

doi.org

arxiv.org

Corrigendum to "Child- and Proxy-reported Differences in Patient-reported Outcome and Experience Measures in Pediatric Surgery: Systematic Review and Meta-analysis" [Journal of Pediatric Surgery 60 (2025) 162172].

Zanib Nafees

Siena O'Neill

Alexandra Dimmer

Elena Guadagno

Julia Ferreira

Nancy Mayo

Dan Poenaru

2025-07-20

Journal of Pediatric Surgery (published)

doi.org

Corrigendum to "Virtual Reality for Pediatric Trauma Education - A Preliminary Face and Content Validation Study" [Journal of Pediatric Surgery 60 (2025) 161951].

F. Botelho

Said Ashkar

Shreenik Kundu

TJ Matthews

Elena Guadagno

Dan Poenaru

2025-07-20

Journal of Pediatric Surgery (published)

doi.org

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Publications

Mila on Udemy

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Publications