Publications

ProGRes: Prompted Generative Rescoring on ASR n-Best
Ada Defne Tur
Adel Moumen
Mirco Ravanaelli
Combining supervised learning and local search for the multicommodity capacitated fixed-charge network design problem
Charly Robinson La Rocca
Jean-François Cordeau
The multicommodity capacitated fixed-charge network design problem has been extensively studied in the literature due to its wide range of a… (see more)pplications. Despite the fact that many sophisticated solution methods exist today, finding high-quality solutions to large-scale instances remains challenging. In this paper, we explore how a data-driven approach can help improve upon the state of the art. By leveraging machine learning models, we attempt to reveal patterns hidden in the data that might be difficult to capture with traditional optimization methods. For scalability, we propose a prediction method where the machine learning model is called at the level of each arc of the graph. We take advantage of off-the-shelf models trained via supervised learning to predict near-optimal solutions. Our experimental results include an algorithm design analysis that compares various integration strategies of predictions within local search algorithms. We benchmark the ML-based approach with respect to the state-of-the-art heuristic for this problem. The findings indicate that our method can outperform the leading heuristic on sets of instances sampled from a uniform distribution.
Decomposing the Brain in Autism: Linking Behavioral Domains to Neuroanatomical Variation and Genomic Underpinnings.
Hanna Seelemeyer
Caroline Gurr
Johanna Leyhausen
Lisa M. Berg
Charlotte M. Pretzsch
Tim Schäfer
Bassem Hermila
Christine M. Freitag
Eva Loth
Beth Oakley
Luke Mason
Jan K. Buitelaar
Christian Beckmann
Dorothea L. Floris
Tony Charman
Tobias Banaschewski
Emily Jones
Thomas Bourgeron
Jumana Ahmad
Sara Ambrosino … (see 58 more)
Bonnie Auyeung
Simon Baron-Cohen
Sarah Baumeister
Sven Bölte
Carsten Bours
Michael Brammer
Daniel Brandeis
Claudia Brogna
Yvette de Bruijn
Bhismadev Chakrabarti
Ineke Cornelissen
Daisy Crawley
Flavio Dell’Acqua
Sarah Durston
Christine Ecker
Jessica Faulkner
Vincent Frouin
Pilar Garcés
David Goyard
Lindsay Ham
Hannah Hayward
Joerg F. Hipp
Rosemary Holt
Mark Johnson
Emily J. H. Jones
Prantik Kundu
Meng-Chuan Lai
Xavier Liogier D’ardhuy
Michael V. Lombardo
David J. Lythgoe
René Mandl
Andre Marquand
Maarten Mennes
Andreas Meyer-Lindenberg
Carolin Moessnang
Nico Bast
Larry O’Dwyer
Marianne Oldehinkel
Bob Oranje
Gahan Pandina
Antonio Persico
Barbara Ruggeri
Declan G.M. Murphy
Amber N. V. Ruigrok
Jessica Sabet
Roberto Sacco
Antonia San José Cáceres
Emily Simonoff
Will Spooren
Julian Tillmann
Roberto Toro
Heike Tost
Jack Waldman
Steve C. R. Williams
Caroline Wooldridge
Marcel P. Zwiers
Declan Murphy
Effects of gene dosage on cognitive ability: A function-based association study across brain and non-brain processes
Thomas Renne
Cécile Poulain
Alma Dubuc
Kuldeep Kumar
Sayeh Kazem
Worrawat Engchuan
Omar Shanta
Elise Douard
Catherine Proulx
Martineau Jean-Louis
Zohra Saci
Josephine Mollon
Laura M. Schultz
Emma E.M. Knowles
Simon R. Cox
David Porteous
Gail Davies
Paul Redmond
Sarah E. Harris … (see 10 more)
Gunter Schumann
Aurélie Labbe
Zdenka Pausova
Tomáš Paus
Stephen W. Scherer
Jonathan Sebat
Laura Almasy
David C. Glahn
Sébastien Jacquemont
Copy-number variants (CNVs) that increase the risk for neurodevelopmental disorders also affect cognitive ability. However, such CNVs remain… (see more) challenging to study due to their scarcity, limiting our understanding of gene-dosage-sensitive biological processes linked to cognitive ability. We performed a genome-wide association study (GWAS) in 258,292 individuals, which identified—for the first time—a duplication at 2q12.3 associated with higher cognitive performance. We developed a functional-burden analysis, which tested the association between cognition and CNVs disrupting 6,502 gene sets biologically defined across tissues, cell types, and ontologies. Among those, 864 gene sets were associated with cognition, and effect sizes of deletion and duplication were negatively correlated. The latter suggested that functions across all biological processes were sensitive to either deletions (e.g., subcortical regions, postsynaptic) or duplications (e.g., cerebral cortex, presynaptic). Associations between non-brain tissues and cognition were driven partly by constrained genes, which may shed light on medical comorbidities in neurodevelopmental disorders.
Patient Engagement in the Implementation of Electronic Patient-Reported Outcome Tools: The Experience of Two Early-Adopter Institutions in the Pan-Canadian Radiotherapy Patient-Reported Outcome Initiative
Amanda Caissie
J. Lane
B. Barber
S. Chisholm
J. Kildea
Predicting the Mathematics Literacy of Resilient Students from High‐performing Economies: A Machine Learning Approach
Yimei Zhang
Towards AI-designed genomes using a variational autoencoder
Natasha K. Dudek
Natasha K. Dudek
Genomes encode elaborate networks of genes whose products must seamlessly interact to support living organisms. Humans’ capacity to unders… (see more)tand these biological systems is limited by their sheer size and complexity. In this article, we develop a proof of concept framework for training a machine learning (ML) algorithm to model bacterial genome composition. To achieve this, we create simplified representations of genomes in the form of binary vectors that indicate the encoded genes, henceforth referred to as genome vectors. A denoising variational autoencoder was trained to accept corrupted genome vectors, in which most genes had been masked, and reconstruct the original. The resulting model, DeepGenomeVector, effectively captures complex dependencies in genomic networks, as evaluated by both qualitative and quantitative metrics. An in-depth functional analysis of a generated genome vector shows that its encoded pathways are interconnected, near complete, and ecologically cohesive. On the test set, where the model’s ability to reconstruct uncorrupted genome vectors was evaluated, Area Under the Receiver Operating Curve (AUROC) and F1 scores of 0.98 and 0.83, respectively, support the model’s strong performance. This article showcases the power of ML approaches for synthetic biology and highlights the possibility that artifical intelligence agents may one day be able to design genomes that animate carbon-based cells.
Uhura: A Benchmark for Evaluating Scientific Question Answering and Truthfulness in Low-Resource African Languages
Edward Bayes
Israel Abebe Azime
Jesujoba Oluwadara Alabi
Jonas Kgomo
Tyna Eloundou
Elizabeth Proehl
Kai Chen
Imaan Khadir
Naome Etori
Shamsuddeen Hassan Muhammad
C. Mpanza
Igneciah Pocia Thete
Dietrich Klakow
Evaluations of Large Language Models (LLMs) on knowledge-intensive tasks and factual accuracy often focus on high-resource languages primari… (see more)ly because datasets for low-resource languages (LRLs) are scarce. In this paper, we present Uhura -- a new benchmark that focuses on two tasks in six typologically-diverse African languages, created via human translation of existing English benchmarks. The first dataset, Uhura-ARC-Easy, is composed of multiple-choice science questions. The second, Uhura-TruthfulQA, is a safety benchmark testing the truthfulness of models on topics including health, law, finance, and politics. We highlight the challenges creating benchmarks with highly technical content for LRLs and outline mitigation strategies. Our evaluation reveals a significant performance gap between proprietary models such as GPT-4o and o1-preview, and Claude models, and open-source models like Meta's LLaMA and Google's Gemma. Additionally, all models perform better in English than in African languages. These results indicate that LMs struggle with answering scientific questions and are more prone to generating false claims in low-resource African languages. Our findings underscore the necessity for continuous improvement of multilingual LM capabilities in LRL settings to ensure safe and reliable use in real-world contexts. We open-source the Uhura Benchmark and Uhura Platform to foster further research and development in NLP for LRLs.
Visual Modality Prompt for Adapting Vision-Language Object Detectors
Heitor Rapela Medeiros
Atif Belal
Srikanth Muralidharan
Eric Granger
The zero-shot performance of object detectors degrades when tested on different modalities, such as infrared and depth. While recent work ha… (see more)s explored image translation techniques to adapt detectors to new modalities, these methods are limited to a single modality and apply only to traditional detectors. Recently, vision-language detectors, such as YOLO-World and Grounding DINO, have shown promising zero-shot capabilities, however, they have not yet been adapted for other visual modalities. Traditional fine-tuning approaches tend to compromise the zero-shot capabilities of the detectors. The visual prompt strategies commonly used for classification with vision-language models apply the same linear prompt translation to each image making them less effective. To address these limitations, we propose ModPrompt, a visual prompt strategy to adapt vision-language detectors to new modalities without degrading zero-shot performance. In particular, an encoder-decoder visual prompt strategy is proposed, further enhanced by the integration of inference-friendly task residuals, facilitating more robust adaptation. Empirically, we benchmark our method for modality adaptation on two vision-language detectors, YOLO-World and Grounding DINO, and on challenging infrared (LLVIP, FLIR) and depth (NYUv2) data, achieving performance comparable to full fine-tuning while preserving the model's zero-shot capability. Our code is available at: https://github.com/heitorrapela/ModPrompt
Machine learning-enhanced immunopeptidomics applied to T-cell epitope discovery for COVID-19 vaccines
Kevin A. Kovalchik
David J. Hamelin
Peter Kubiniok
Benoîte Bourdin
Raphaël Poujol
Bastien Paré
Shawn M. Simpson
John Sidney
Éric Bonneil
Mathieu Courcelles
Sunil Kumar Saini
Mohammad Shahbazy
Saketh Kapoor
Vigneshwar Rajesh
Maya Weitzen
Jean-Christophe Grenier
Bayrem Gharsallaoui
Loïze Maréchal
Zhaoguan Wu … (see 9 more)
Christopher Savoie
Alessandro Sette
Pierre Thibault
Isabelle Sirois
Martin A. Smith
Hélène Decaluwe
Julie G. Hussin
Mathieu Lavallée-Adam
Etienne Caron
Next-generation T-cell-directed vaccines for COVID-19 focus on establishing lasting T-cell immunity against current and emerging SARS-CoV-2 … (see more)variants. Precise identification of conserved T-cell epitopes is critical for designing effective vaccines. Here we introduce a comprehensive computational framework incorporating a machine learning algorithm—MHCvalidator—to enhance mass spectrometry-based immunopeptidomics sensitivity. MHCvalidator identifies unique T-cell epitopes presented by the B7 supertype, including an epitope from a + 1-frameshift in a truncated Spike antigen, supported by ribosome profiling. Analysis of 100,512 COVID-19 patient proteomes shows Spike antigen truncation in 0.85% of cases, revealing frameshifted viral antigens at the population level. Our EpiTrack pipeline tracks global mutations of MHCvalidator-identified CD8 + T-cell epitopes from the BNT162b4 vaccine. While most vaccine epitopes remain globally conserved, an immunodominant A*01-associated epitope mutates in Delta and Omicron variants. This work highlights SARS-CoV-2 antigenic features and emphasizes the importance of continuous adaptation in T-cell vaccine development.
Words Matter: Leveraging Individual Text Embeddings for Code Generation in CLIP Test-Time Adaptation
Shambhavi Mishra
Julio Silva-Rodríguez
Ismail Ben Ayed
Jose Dolz
Vision-language foundation models, such as CLIP, have shown unprecedented zero-shot performance across a wide range of tasks. Nevertheless, … (see more)these models may be unreliable under distributional shifts, as their performance is significantly degraded. In this work, we explore how to efficiently leverage class text information to mitigate these distribution drifts encountered by large pre-trained vision-language models (VLMs) during test-time inference. In particular, we propose to generate pseudo-labels for the test-time samples by exploiting generic class text embeddings as fixed centroids of a label assignment problem, which is efficiently solved with Optimal Transport. Furthermore, the proposed adaptation method (CLIP-OT) integrates a multiple template knowledge distillation approach, which replicates multi-view contrastive learning strategies in unsupervised representation learning but without incurring additional computational complexity. Extensive experiments on multiple popular test-time adaptation benchmarks presenting diverse complexity empirically show the superiority of CLIP-OT, achieving performance gains of up to 7% over recent state-of-the-art methods, yet being computationally and memory efficient.
Comparative Analysis of Diffusion Generative Models in Computational Pathology
Denisha Thakkar
Vincent Quoc-Huy Trinh
Sonal Varma
S Ebrahimi Kahou
Hassan Rivaz
Mahdi S. Hosseini
Diffusion Generative Models (DGM) have rapidly surfaced as emerging topics in the field of computer vision, garnering significant interest a… (see more)cross a wide array of deep learning applications. Despite their high computational demand, these models are extensively utilized for their superior sample quality and robust mode coverage. While research in diffusion generative models is advancing, exploration within the domain of computational pathology and its large-scale datasets has been comparatively gradual. Bridging the gap between the high-quality generation capabilities of Diffusion Generative Models and the intricate nature of pathology data, this paper presents an in-depth comparative analysis of diffusion methods applied to a pathology dataset. Our analysis extends to datasets with varying Fields of View (FOV), revealing that DGMs are highly effective in producing high-quality synthetic data. An ablative study is also conducted, followed by a detailed discussion on the impact of various methods on the synthesized histopathology images. One striking observation from our experiments is how the adjustment of image size during data generation can simulate varying fields of view. These findings underscore the potential of DGMs to enhance the quality and diversity of synthetic pathology data, especially when used with real data, ultimately increasing accuracy of deep learning models in histopathology. Code is available from https://github.com/AtlasAnalyticsLab/Diffusion4Path