Publications

XTREME-UP: A User-Centric Scarce-Data Benchmark for Under-Represented Languages
Sebastian Ruder
Jonathan H. Clark
Alexander Gutkin
Mihir Kale
Min Ma
Massimo Nicosia
Shruti Rijhwani
Parker Riley
Jean Michel Amath Sarr
Xinyi Wang
John Frederick Wieting
Nitish Gupta
Anna Katanova
Christo Kirov
Dana L Dickinson
Brian Roark
Bidisha Samanta
Connie Tao
Vera Axelrod … (see 7 more)
Isaac Rayburn Caswell
Colin Cherry
Dan Garrette
Reeve Ingle
Melvin Johnson
Dmitry Panteleev
Partha Talukdar
Data scarcity is a crucial issue for the development of highly multilingual NLP systems. Yet for many under-represented languages (ULs) -- l… (see more)anguages for which NLP re-search is particularly far behind in meeting user needs -- it is feasible to annotate small amounts of data. Motivated by this, we propose XTREME-UP, a benchmark defined by: its focus on the scarce-data scenario rather than zero-shot; its focus on user-centric tasks -- tasks with broad adoption by speakers of high-resource languages; and its focus on under-represented languages where this scarce-data scenario tends to be most realistic. XTREME-UP evaluates the capabilities of language models across 88 under-represented languages over 9 key user-centric technologies including ASR, OCR, MT, and information access tasks that are of general utility. We create new datasets for OCR, autocomplete, semantic parsing, and transliteration, and build on and refine existing datasets for other tasks. XTREME-UP provides methodology for evaluating many modeling scenarios including text-only, multi-modal (vision, audio, and text),supervised parameter tuning, and in-context learning. We evaluate commonly used models on the benchmark. We release all code and scripts to train and evaluate models
Learning domain-invariant classifiers for infant cry sounds
Charles Onu
Hemanth K. Sheetha
Arsenii Gorin
Active learning meets fractal decision boundaries: a cautionary tale from the Sitnikov three-body problem
Nicolas Payot
Mario Pasquato
Alessandro A. Trani
Chaotic systems such as the gravitational N-body problem are ubiquitous in astronomy. Machine learning (ML) is increasingly deployed to pred… (see more)ict the evolution of such systems, e.g. with the goal of speeding up simulations. Strategies such as active Learning (AL) are a natural choice to optimize ML training. Here we showcase an AL failure when predicting the stability of the Sitnikov three-body problem, the simplest case of N-body problem displaying chaotic behavior. We link this failure to the fractal nature of our classification problem's decision boundary. This is a potential pitfall in optimizing large sets of N-body simulations via AL in the context of star cluster physics, galactic dynamics, or cosmology.
Bayesian Imaging for Radio Interferometry with Score-Based Priors
No'e Dia
M. J. Yantovski-Barth
Alexandre Adam
Micah Bowles
Pablo Lemos
A. Scaife
U. Montŕeal
Ciela Institute
Flatiron Institute
Echoes in the Noise: Posterior Samples of Faint Galaxy Surface Brightness Profiles with Score-Based Likelihoods and Priors
Alexandre Adam
Connor Stone
Connor Bottrell
Ronan Legin
Examining the detailed structure of galaxy populations provides valuable insights into their formation and evolution mechanisms. Significant… (see more) barriers to such analysis are the non-trivial noise properties of real astronomical images and the point spread function (PSF) which blurs structure. Here we present a framework which combines recent advances in score-based likelihood characterization and diffusion model priors to perform a Bayesian analysis of image deconvolution. The method, when applied to minimally processed \emph{Hubble Space Telescope} (\emph{HST}) data, recovers structures which have otherwise only become visible in next-generation \emph{James Webb Space Telescope} (\emph{JWST}) imaging.
Extrapolatable Transformer Pre-training for Ultra Long Time-Series Forecasting
Ziyang Song
Qincheng Lu
Hao Xu
He Zhu
Learning an Effective Evolution Equation for Particle-Mesh Simulations Across Cosmologies
Nicolas Payot
Pablo Lemos
Carolina Cuesta-lazaro
C. Modi
Silent bugs in deep learning frameworks: an empirical study of Keras and TensorFlow
Florian Tambon
Amin Nikanjam
Le An
Giuliano Antoniol
Unraveling the Mysteries of Galaxy Clusters: Recurrent Inference Deconvolution of X-ray Spectra
C. Rhea
Julie Hlavacek-larrondo
Ralph P. Kraft
Ákos Bogdán
Alexandre Adam
H3K27me3 spreading organizes canonical PRC1 chromatin architecture to regulate developmental programs
Brian Krug
Bo Hu
Haifen Chen
Adam Ptack
Xiao Chen
Kristjan H. Gretarsson
Shriya Deshmukh
Nisha Kabir
Augusto Faria Andrade
Elias Jabbour
Ashot S. Harutyunyan
John J. Y. Lee
Maud Hulswit
Damien Faury
Caterina Russo
Xinjing Xu
Michael Johnston
Audrey Baguette
Nathan A. Dahl
Alexander G. Weil … (see 12 more)
Benjamin Ellezam
Rola Dali
Khadija Wilson
Benjamin A. Garcia
Rajesh Kumar Soni
Marco Gallo
Michael D. Taylor
Claudia Kleinman
Jacek Majewski
Nada Jabado
Chao Lu
Harnessing TCR/CAR Antagonism to Enhance Immunotherapeutic Precision
Taisuke Kondo
François X. P. Bourassa
Sooraj R. Achar
Justyn DuSold
Pablo Cespedes
Madison Wahlsten
Audun Kvalvaag
Guillaume Gaud
Paul E. Love
Michael Dustin
Grégoire Altan-Bonnet
Naomi Taylor
Identification of Acute Myeloid Leukemia Cell Surface Therapeutic Targets Using Single Cell RNA Sequencing Supported By Surface Proteomics
Véronique Lisi
Banafsheh Khakipoor
Azer Farah
Marie-Eve Bordeleau
Éric Audemard
Arnaud Metois
Louis Theret
Jean-Francois Spinella
Jalila Chagraoui
Ossama Moujaber
Laure Mallinger
Isabel Boivin
Nadine Mayotte
Azadeh Hajmirza
Eric Bonneil
Francois Béliveau
Albert Feghali
Geneviève Boucher
Patrick Gendron
Frederic Barabe … (see 6 more)
Guillaume Richard-Carpentier
Josée Hébert
Philippe Roux
Guy Sauvageau
Vincent-Philippe Lavallee
Background: Acute myeloid leukemia (AML) comprises diverse genomic subgroups and remains hard to treat in most patients. Despite breakthrou… (see more)ghs in the therapeutic arsenal in recent years, clinical usage of therapeutic antibodies or chimeric antigen receptor T (CAR-T) cells has been lagging in contrast to other hematological malignancies. In fact, CD33 represents the only antibody-based strategy approved for this disease to date, highlighting the need to identify new promising targets. AML cells span a wide range of aberrant myeloid differentiation programs, complexifying the identification, by bulk genomics, of targets expressed in the most immature leukemic cells. Aims and Methods: To identify the expression landscape of surface proteins in immature leukemic cells, we performed single-cell RNA sequencing (scRNA-seq, 10x 3' Reagent Kits) of primary human AML cells from 20 specimens of the Leucegene cohort enriched in intermediate and adverse genetic backgrounds ( KMT2A-rearranged n=5, chromosome 5 and/or 7 deletions (abn5/7, n=5) complex karyotype (n=4), NPM1/DNMT3A/FLT3-ITD triple-mutant (n=3) and others (n=3)). A Random Forest classifier was developed to unbiasedly classify AML cells into distinct differentiation stages using normal bone marrow-derived scRNA-seq data from the Human Cell Atlas (HCA) consortium. Genes were scored based on their probability of coding for proteins expressed at the cell surface using the SPAT algorithm developed by our group (https://doi.org/10.1101/2023.07.07.547075), retaining high score ones. To validate surface expression, we concomitantly analyzed the surface proteome (hereafter named surfaceome) of 100 primary human AML samples from the Leucegene cohort, including all 20 samples profiled by scRNA-seq. Results: After quality control, we profiled and characterized 103 690 high quality cells (mean of 5185 cells/sample). We trained a Random Forest classifier to annotate cells in a two step process, first identifying plasma cells based on a restricted list of genes abundantly expressed in these cells and subsequently assigning the remaining cells to one of 33 cell types. We performed a five-fold cross validation of the model and subsequently determined the accuracy of our classifier to be 92% on the test subset of the HCA data. Applied to our AML cell collection, a total of 35 053 cells (34%) were unbiasedly classified as Hematopoietic Stem Cell (HSC)-like, corresponding to the most phenotypically immature leukemic cells in each patient sample (ranging from 4 to 74 %). Accordingly, HSC-like AML cells preferentially express genes associated with normal HSCs, such as CD34, FAM30A, and SPINK2, and globally lack expression of mature lineages defining genes, further validating our classifier. The proportion of HSC-like cells varied among AML subgroups, and was lowest in KMT2A-r AML (median 19%) and highest in abn5/7 samples (46%). Integration of our AML atlas using Harmony algorithm preserved differentiation hierarchies across samples, with most cell types, including HSC-like cells, occupying a defined area in the low dimensional embedding. To identify new surface antigens specifically expressed in immature leukemic cells, we compared the high (≥8) SPAT score gene expression profile of AML HSC-like cells with that of normal HSC cells (HCA), and identified 60 genes significantly overexpressed in AML immature cells. Of those, 39 genes were also detected at the protein level by the surfaceome analysis, supporting their predicted expression at the cell surface in AML samples. 59% of these 39 genes (n=23) were detected in over 80% of the specimens analyzed by the surfaceome, and thus are nearly universally expressed in our AML cohort. To identify targets of therapies that could be repurposed, we next evaluated the relevance of our findings by querying the Thera-SAbDab database. Most interestingly, 8 of the 39 AML specific HSC markers are targeted by therapeutic antibodies FDA-approved or in clinical trials for the treatment of AML (n=4, IL3RA, FLT3, CD37 and TNFRSF10B) or other indications (n = 4). Conclusion Our genetically diverse AML single-cell atlas, supported by mass spectrometry, enables the identification of both subset-specific and pan-AML surface protein genes. These represent potential targets for antibody based strategy development or therapy repurposing in AML.