Publications

Mass‐spectrometry analysis of the human pineal proteome during night and day and in autism
Hany Goubran‐Botros
Mariette Matondo
Cécile Pagan
Cyril Boulègue
Thibault Chaze
Julia Chamot‐Rooke
Erik Maronde
Thomas Bourgeron
The human pineal gland regulates day‐night dynamics of multiple physiological processes, especially through the secretion of melatonin. Us… (see more)ing mass‐spectrometry‐based proteomics and dedicated analysis tools, we identify proteins in the human pineal gland and analyze systematically their variation throughout the day and compare these changes in the pineal proteome between control specimens and donors diagnosed with autism. Results reveal diverse regulated clusters of proteins with, among others, catabolic carbohydrate process and cytoplasmic membrane‐bounded vesicle‐related proteins differing between day and night and/or control versus autism pineal glands. These data show novel and unexpected processes happening in the human pineal gland during the day/night rhythm as well as specific differences between autism donor pineal glands and those from controls.
Phylogenetic Manifold Regularization: A semi-supervised approach to predict transcription factor binding sites
Faizy Ahsan
François Laviolette
The computational prediction of transcription factor binding sites remains a challenging problems in bioinformatics, despite significant met… (see more)hodological developments from the field of machine learning. Such computational models are essential to help interpret the non-coding portion of human genomes, and to learn more about the regulatory mechanisms controlling gene expression. In parallel, massive genome sequencing efforts have produced assembled genomes for hundred of vertebrate species, but this data is underused. We present PhyloReg, a new semi-supervised learning approach that can be used for a wide variety of sequence-to-function prediction problems, and that takes advantage of hundreds of millions of years of evolution to regularize predictors and improve accuracy. We demonstrate that PhyloReg can be used to better train a previously proposed deep learning model of transcription factor binding. Simulation studies further help delineate the benefits of the a pproach. G ains in prediction accuracy are obtained over a broad set of transcription factors and cell types.
Assessing the Electronic Evidence System Needs of Canadian Public Health Professionals: A Cross-Sectional Study (Preprint)
Bandna Dhaliwal
Sarah E Neil-Sztramko
Nikita Boston-Fisher
Maureen Dobbins
BACKGROUND True evidence-informed decision making in public health relies on incorporating evidence from a number of sources in addition to… (see more) traditional scientific evidence. Lack of access to these types of data, as well as ease of use and interpretability of scientific evidence contribute to limited uptake of evidence-informed decision making in practice. An electronic evidence system that includes multiple sources of evidence and potentially novel computational processing approaches or artificial intelligence holds promise as a solution to overcoming barriers to evidence-informed decision making in public health. OBJECTIVE To understand the needs and preferences for an electronic evidence system among public health professionals in Canada. METHODS An invitation to participate in an anonymous online survey was distributed via listservs of two Canadian public health organizations. Eligible participants were English or French speaking individuals currently working in public health. The survey contained both multiple choice and open-ended questions about needs and preferences relevant to an electronic evidence system. Quantitative responses were analyzed to explore differences by public health role. Inductive and deductive analysis methods were used to code and interpret the qualitative data. Ethics review was not required by the host institution. RESULTS Respondents (n = 371) were heterogeneous, spanning organizations, positions, and areas of practice within public health. Nearly all (98.0%) respondents indicated that an electronic evidence system would support their work. Respondents had high preferences for local contextual data, research and intervention evidence, and information about human and financial resources. Qualitative analyses identified a number of concerns, needs, and suggestions for development of such a system. Concerns ranged from personal use of such a system, to the ability of their organization to use such a system. Identified needs spanned the different sources of evidence including local context, research and intervention evidence, and resources and tools. Additional suggestions were identified to improve system usability. CONCLUSIONS Canadian public health professionals have positive perceptions towards an electronic evidence system that would bring together evidence from the local context, scientific research, and resources. Elements were also identified to increase the usability of an electronic evidence system.
Graph Neural Networks Learn Twitter Bot Behaviour
Albert Manuel Orozco Camacho
Sacha Lévy
Social media trends are increasingly taking a significant role for the understanding of modern social dynamics. In this work, we take a look… (see more) at how the Twitter landscape gets constantly shaped by automatically generated content. Twitter bot activity can be traced via network abstractions which, we hypothesize, can be learned through state-of-the-art graph neural network techniques. We employ a large bot database, continuously updated by Twitter, to learn how likely is that a user is mentioned by a bot, as well as, for a hashtag. Thus, we model this likelihood as a link prediction task between the set of users and hashtags. Moreover, we contrast our results by performing similar experiments on a crawled data set of real users.
Extendable and invertible manifold learning with geometry regularized autoencoders
Andres F. Duque Correa
Sacha Morin
Kevin R. Moon
A fundamental task in data exploration is to extract simplified low dimensional representations that capture intrinsic geometry in data, esp… (see more)ecially for faithfully visualizing data in two or three dimensions. Common approaches to this task use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend to new data points. Autoencoders have also recently become popular for representation learning. But while they naturally compute feature extractors that are both extendable to new data and invertible (i.e., reconstructing original features from latent representation), they have limited capabilities to follow global intrinsic geometry compared to kernel-based manifold learning. We present a new method for integrating both approaches by incorporating a geometric regularization term in the bottleneck of the autoencoder. Our regularization, based on the diffusion potential distances from the recently-proposed PHATE visualization method, encourages the learned latent representation to follow intrinsic data geometry, similar to manifold learning algorithms, while still enabling faithful extension to new data and reconstruction of data in the original feature space from latent coordinates. We compare our approach with leading kernel methods and autoencoder models for manifold learning to provide qualitative and quantitative evidence of our advantages in preserving intrinsic structure, out of sample extension, and reconstruction. Our method is easily implemented for big-data applications, whereas other methods are limited in this regard.
A Study of Condition Numbers for First-Order Optimization
Charles Guille-Escuret
Baptiste Goujaud
Manuela Girotti
The study of first-order optimization algorithms (FOA) typically starts with assumptions on the objective functions, most commonly smoothnes… (see more)s and strong convexity. These metrics are used to tune the hyperparameters of FOA. We introduce a class of perturbations quantified via a new norm, called *-norm. We show that adding a small perturbation to the objective function has an equivalently small impact on the behavior of any FOA, which suggests that it should have a minor impact on the tuning of the algorithm. However, we show that smoothness and strong convexity can be heavily impacted by arbitrarily small perturbations, leading to excessively conservative tunings and convergence issues. In view of these observations, we propose a notion of continuity of the metrics, which is essential for a robust tuning strategy. Since smoothness and strong convexity are not continuous, we propose a comprehensive study of existing alternative metrics which we prove to be continuous. We describe their mutual relations and provide their guaranteed convergence rates for the Gradient Descent algorithm accordingly tuned. Finally we discuss how our work impacts the theoretical understanding of FOA and their performances.
Team Optimal Control of Coupled Major-Minor Subsystems with Mean-Field Sharing
Jalal Arabneydi
Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin
Bogdan Mazoure
Pierre Legendre
The SARS-CoV-2 pandemic is one of the greatest global medical and social challenges that have emerged in recent history. Human coronavirus s… (see more)trains discovered during previous SARS outbreaks have been hypothesized to pass from bats to humans using intermediate hosts, e.g. civets for SARS-CoV and camels for MERS-CoV. The discovery of an intermediate host of SARS-CoV-2 and the identification of specific mechanism of its emergence in humans are topics of primary evolutionary importance. In this study we investigate the evolutionary patterns of 11 main genes of SARS-CoV-2. Previous studies suggested that the genome of SARS-CoV-2 is highly similar to the horseshoe bat coronavirus RaTG13 for most of the genes and to some Malayan pangolin coronavirus (CoV) strains for the receptor binding (RB) domain of the spike protein. We provide a detailed list of statistically significant horizontal gene transfer and recombination events (both intergenic and intragenic) inferred for each of 11 main genes of the SARS-CoV-2 genome. Our analysis reveals that two continuous regions of genes S and N of SARS-CoV-2 may result from intragenic recombination between RaTG13 and Guangdong (GD) Pangolin CoVs. Statistically significant gene transfer-recombination events between RaTG13 and GD Pangolin CoV have been identified in region [1215–1425] of gene S and region [534–727] of gene N. Moreover, some statistically significant recombination events between the ancestors of SARS-CoV-2, RaTG13, GD Pangolin CoV and bat CoV ZC45-ZXC21 coronaviruses have been identified in genes ORF1ab, S, ORF3a, ORF7a, ORF8 and N. Furthermore, topology-based clustering of gene trees inferred for 25 CoV organisms revealed a three-way evolution of coronavirus genes, with gene phylogenies of ORF1ab, S and N forming the first cluster, gene phylogenies of ORF3a, E, M, ORF6, ORF7a, ORF7b and ORF8 forming the second cluster, and phylogeny of gene ORF10 forming the third cluster. The results of our horizontal gene transfer and recombination analysis suggest that SARS-CoV-2 could not only be a chimera virus resulting from recombination of the bat RaTG13 and Guangdong pangolin coronaviruses but also a close relative of the bat CoV ZC45 and ZXC21 strains. They also indicate that a GD pangolin may be an intermediate host of this dangerous virus.
An Analysis of Dataset Overlap on Winograd-Style Tasks
Ali Emami
Adam Trischler
Kaheer Suleman
The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model per… (see more)formance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.
Autonomous navigation of stratospheric balloons using reinforcement learning
S. Candido
Jun Gong
Marlos C. Machado
Subhodeep Moitra
Sameera S. Ponda
Ziyun Wang
Learning Efficient Task-Specific Meta-Embeddings with Word Prisms
Jingyi He
Kc Tsiolis
Kian Kenyon-Dean
Word embeddings are trained to predict word cooccurrence statistics, which leads them to possess different lexical properties (syntactic, se… (see more)mantic, etc.) depending on the notion of context defined at training time. These properties manifest when querying the embedding space for the most similar vectors, and when used at the input layer of deep neural networks trained to solve downstream NLP problems. Meta-embeddings combine multiple sets of differently trained word embeddings, and have been shown to successfully improve intrinsic and extrinsic performance over equivalent models which use just one set of source embeddings. We introduce word prisms: a simple and efficient meta-embedding method that learns to combine source embeddings according to the task at hand. Word prisms learn orthogonal transformations to linearly combine the input source embeddings, which allows them to be very efficient at inference time. We evaluate word prisms in comparison to other meta-embedding methods on six extrinsic evaluations and observe that word prisms offer improvements in performance on all tasks.
Learning Lexical Subspaces in a Distributional Vector Space
Kushal Arora
Aishik Chakraborty
Abstract In this paper, we propose LexSub, a novel approach towards unifying lexical and distributional semantics. We inject knowledge about… (see more) lexical-semantic relations into distributional word embeddings by defining subspaces of the distributional vector space in which a lexical relation should hold. Our framework can handle symmetric attract and repel relations (e.g., synonymy and antonymy, respectively), as well as asymmetric relations (e.g., hypernymy and meronomy). In a suite of intrinsic benchmarks, we show that our model outperforms previous approaches on relatedness tasks and on hypernymy classification and detection, while being competitive on word similarity tasks. It also outperforms previous systems on extrinsic classification tasks that benefit from exploiting lexical relational cues. We perform a series of analyses to understand the behaviors of our model.1 Code available at https://github.com/aishikchakraborty/LexSub.