Publications

Phylogenetic Manifold Regularization: A semi-supervised approach to predict transcription factor binding sites

Francois Laviolette

The computational prediction of transcription factor binding sites remains a challenging problems in bioinformatics, despite significant m e… (voir plus)thodological d evelopments f rom t he field of machine learning. Such computational models are essential to help interpret the non-coding portion of human genomes, and to learn more about the regulatory mechanisms controlling gene expression. In parallel, massive genome sequencing efforts have produced assembled genomes for hundred of vertebrate species, but this data is underused. We present PhyloReg, a new semi-supervised learning approach that can be used for a wide variety of sequence-to-function prediction problems, and that takes advantage of hundreds of millions of years of evolution to regularize predictors and improve accuracy. We demonstrate that PhyloReg can be used to better train a previously proposed deep learning model of transcription factor binding. Simulation studies further help delineate the benefits o f t he a pproach. G ains in prediction accuracy are obtained over a broad set of transcription factors and cell types.

2020-12-15

2020 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) (publié)

doi.org

Assessing the Electronic Evidence System Needs of Canadian Public Health Professionals: A Cross-Sectional Study (Preprint)

Bandna Dhaliwal

Sarah E Neil-Sztramko

Nikita Boston-Fisher

David L Buckeridge

Maureen Dobbins

BACKGROUND True evidence-informed decision making in public health relies on incorporating evidence from a number of sources in addition to… (voir plus) traditional scientific evidence. Lack of access to these types of data, as well as ease of use and interpretability of scientific evidence contribute to limited uptake of evidence-informed decision making in practice. An electronic evidence system that includes multiple sources of evidence and potentially novel computational processing approaches or artificial intelligence holds promise as a solution to overcoming barriers to evidence-informed decision making in public health. OBJECTIVE To understand the needs and preferences for an electronic evidence system among public health professionals in Canada. METHODS An invitation to participate in an anonymous online survey was distributed via listservs of two Canadian public health organizations. Eligible participants were English or French speaking individuals currently working in public health. The survey contained both multiple choice and open-ended questions about needs and preferences relevant to an electronic evidence system. Quantitative responses were analyzed to explore differences by public health role. Inductive and deductive analysis methods were used to code and interpret the qualitative data. Ethics review was not required by the host institution. RESULTS Respondents (n = 371) were heterogeneous, spanning organizations, positions, and areas of practice within public health. Nearly all (98.0%) respondents indicated that an electronic evidence system would support their work. Respondents had high preferences for local contextual data, research and intervention evidence, and information about human and financial resources. Qualitative analyses identified a number of concerns, needs, and suggestions for development of such a system. Concerns ranged from personal use of such a system, to the ability of their organization to use such a system. Identified needs spanned the different sources of evidence including local context, research and intervention evidence, and resources and tools. Additional suggestions were identified to improve system usability. CONCLUSIONS Canadian public health professionals have positive perceptions towards an electronic evidence system that would bring together evidence from the local context, scientific research, and resources. Elements were also identified to increase the usability of an electronic evidence system.

2020-12-13

JMIR Public Health and Surveillance (publié)

doi.org

Graph Neural Networks Learn Twitter Bot Behaviour

Albert M. Orozco Camacho

Sacha Lévy

Reihaneh Rabbany

Social media trends are increasingly taking a significant role for the understanding of modern social dynamics. In this work, we take a look… (voir plus) at how the Twitter landscape gets constantly shaped by automatically generated content. Twitter bot activity can be traced via network abstractions which, we hypothesize, can be learned through state-of-the-art graph neural network techniques. We employ a large bot database, continuously updated by Twitter, to learn how likely is that a user is mentioned by a bot, as well as, for a hashtag. Thus, we model this likelihood as a link prediction task between the set of users and hashtags. Moreover, we contrast our results by performing similar experiments on a crawled data set of real users.

2020-12-11

LatinX in AI at Neural Information Processing Systems Conference 2020 (publié)

doi.org

Extendable and invertible manifold learning with geometry regularized autoencoders

Andrés F. Duque

Sacha Morin

Guy Wolf

Kevin Moon

A fundamental task in data exploration is to extract simplified low dimensional representations that capture intrinsic geometry in data, esp… (voir plus)ecially for faithfully visualizing data in two or three dimensions. Common approaches to this task use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend to new data points. Autoencoders have also recently become popular for representation learning. But while they naturally compute feature extractors that are both extendable to new data and invertible (i.e., reconstructing original features from latent representation), they have limited capabilities to follow global intrinsic geometry compared to kernel-based manifold learning. We present a new method for integrating both approaches by incorporating a geometric regularization term in the bottleneck of the autoencoder. Our regularization, based on the diffusion potential distances from the recently-proposed PHATE visualization method, encourages the learned latent representation to follow intrinsic data geometry, similar to manifold learning algorithms, while still enabling faithful extension to new data and reconstruction of data in the original feature space from latent coordinates. We compare our approach with leading kernel methods and autoencoder models for manifold learning to provide qualitative and quantitative evidence of our advantages in preserving intrinsic structure, out of sample extension, and reconstruction. Our method is easily implemented for big-data applications, whereas other methods are limited in this regard.

2020-12-09

2020 IEEE International Conference on Big Data (Big Data) (publié)

doi.org

arxiv.org

Machine Learning for Glacier Monitoring in the Hindu Kush Himalaya

Shimaa Baraka

Benjamin Akera

Bibek Aryal

Tenzing Chogyal Sherpa

Finu Shresta

Anthony Ortiz

Kris Sankaran

Juan Lavista Ferres

M. Matin

Yoshua Bengio

2020-12-08

ArXiv (prépublication)

arxiv.org

Team Optimal Control of Coupled Major-Minor Subsystems with Mean-Field Sharing

Jalal Arabneydi

Aditya Mahajan

2020-12-03

ArXiv (prépublication)

arxiv.org

Historical and cross-disciplinary trends in the biological and social sciences reveal an accelerating adoption of advanced analytics

Taylor Bolt

Jason S. Nomi

Danilo Bzdok

Lucina Q. Uddin

Methods for data analysis in the biomedical, life and social sciences are developing at a rapid pace. At the same time, there is increasing … (voir plus)concern that education in quantitative methods is failing to adequately prepare students for contemporary research. These trends have led to calls for educational reform to undergraduate and graduate quantitative research method curricula. We argue that such reform should be based on data-driven insights into within- and cross-disciplinary use of research methods. Our survey of peer-reviewed literature screened ∼3.5 million openly available research articles to monitor the cross-disciplinary usage of research methods in the past decade. We applied data-driven text-mining analyses to the methods and materials section of a large subset of this corpus to identify method trends shared across disciplines, as well as those unique to each discipline. As a whole, usage of T -test, analysis of variance, and other classical regression-based methods has declined in the published literature over the past 10 years. Machine-learning approaches, such as artificial neural networks, have seen a significant increase in the total share of scientific publications. We find unique groupings of research methods associated with each biomedical, life and social science discipline, such as the use of structural equation modeling in psychology, survival models in oncology, and manifold learning in ecology. We discuss the implications of these findings for education in statistics and research methods, as well as within- and cross-disciplinary collaboration.

2020-12-02

bioRxiv (prépublication)

doi.org

Horizontal gene transfer and recombination analysis of SARS-CoV-2 genes helps discover its close relatives and shed light on its origin

Vladimir Makarenkov

Bogdan Mazoure

Guillaume Rabusseau

Pierre Legendre

The SARS-CoV-2 pandemic is one of the greatest global medical and social challenges that have emerged in recent history. Human corona… (voir plus)virus strains discovered during previous SARS outbreaks have been hypothesized to pass from bats to humans using intermediate hosts, e.g. civets for SARS-CoV and camels for MERS-CoV. The discovery of an intermediate host of SARS-CoV-2 and the identification of specific mechanism of its emergence in humans are topics of primary evolutionary importance. In this study we investigate the evolutionary patterns of 11 main genes of SARS-CoV-2. Previous studies suggested that the genome of SARS-CoV-2 is highly similar to the horseshoe bat coronavirus RaTG13 for most of the genes and to some Malayan pangolin coronavirus (CoV) strains for the receptor binding (RB) domain of the spike protein. We provide a detailed list of statistically significant horizontal gene transfer and recombination events (both intergenic and intragenic) inferred for each of 11 main genes of the SARS-CoV-2 genome. Our analysis reveals that two continuous regions of genes S and N of SARS-CoV-2 may result from intragenic recombination between RaTG13 and Guangdong (GD) Pangolin CoVs. Statistically significant gene transfer-recombination events between RaTG13 and GD Pangolin CoV have been identified in region [1215–1425] of gene S and region [534–727] of gene N. Moreover, some statistically significant recombination events between the ancestors of SARS-CoV-2, RaTG13, GD Pangolin CoV and bat CoV ZC45-ZXC21 coronaviruses have been identified in genes ORF1ab, S, ORF3a, ORF7a, ORF8 and N. Furthermore, topology-based clustering of gene trees inferred for 25 CoV organisms revealed a three-way evolution of coronavirus genes, with gene phylogenies of ORF1ab, S and N forming the first cluster, gene phylogenies of ORF3a, E, M, ORF6, ORF7a, ORF7b and ORF8 forming the second cluster, and phylogeny of gene ORF10 forming the third cluster. The results of our horizontal gene transfer and recombination analysis suggest that SARS-CoV-2 could not only be a chimera virus resulting from recombination of the bat RaTG13 and Guangdong pangolin coronaviruses but also a close relative of the bat CoV ZC45 and ZXC21 strains. They also indicate that a GD pangolin may be an intermediate host of this dangerous virus.

2020-12-02

BMC Ecology and Evolution (publié)

doi.org

Intervention Design for Effective Sim2Real Transfer

Melissa Mozian

Amy Zhang

Joelle Pineau

David Meger

The goal of this work is to address the recent success of domain randomization and data augmentation for the sim2real setting. We explain th… (voir plus)is success through the lens of causal inference, positioning domain randomization and data augmentation as interventions on the environment which encourage invariance to irrelevant features. Such interventions include visual perturbations that have no effect on reward and dynamics. This encourages the learning algorithm to be robust to these types of variations and learn to attend to the true causal mechanisms for solving the task. This connection leads to two key findings: (1) perturbations to the environment do not have to be realistic, but merely show variation along dimensions that also vary in the real world, and (2) use of an explicit invariance-inducing objective improves generalization in sim2sim and sim2real transfer settings over just data augmentation or domain randomization alone. We demonstrate the capability of our method by performing zero-shot transfer of a robot arm reach task on a 7DoF Jaco arm learning from pixel observations.

2020-12-02

ArXiv (prépublication)

arxiv.org

Quantitative Equational Reasoning

Giorgio Bacci

Radu Mardare

Prakash Panangaden

Gordon Plotkin

Equational logic is central to reasoning about programs.What is the right equational setting for reasoning about probabilistic programs? It … (voir plus)has been understood that instead of equivalence relations one should work with (pseudo)metrics in a probabilistic setting. However, it is not clear how this relates to equational reasoning. In recent work the notion of a quantitative equational logic was introduced and developed. This retains many of the features of ordinary logic but fits naturally with metric reasoning. The present chapter is an elementry introduction to this topic. In this setting one can define analogues of algebras and free algebras. It turns out that the Kantorovich (Wasserstein) metric emerges as a free construction from a simple quantitative equational theory. We give a couple of examples of quantitative analogues of familiar effects from programming language theory. We do not assume any background in equational logic or advanced category theory.

2020-12-02

Foundations of Probabilistic Programming (publié)

doi.org

An Analysis of Dataset Overlap on Winograd-Style Tasks

Ali Emami

Adam Trischler

Kaheer Suleman

Jackie CK Cheung

The Winograd Schema Challenge (WSC) and variants inspired by it have become important benchmarks for common-sense reasoning (CSR). Model per… (voir plus)formance on the WSC has quickly progressed from chance-level to near-human using neural language models trained on massive corpora. In this paper, we analyze the effects of varying degrees of overlaps that occur between these corpora and the test instances in WSC-style tasks. We find that a large number of test instances overlap considerably with the pretraining corpora on which state-of-the-art models are trained, and that a significant drop in classification accuracy occurs when models are evaluated on instances with minimal overlap. Based on these results, we provide the WSC-Web dataset, consisting of over 60k pronoun disambiguation problems scraped from web data, being both the largest corpus to date, and having a significantly lower proportion of overlaps with current pretraining corpora.

2020-11-30

Proceedings of the 28th International Conference on Computational Linguistics (publié)

doi.org

arxiv.org

Autonomous navigation of stratospheric balloons using reinforcement learning

Bellemare Marc-Emmanuel

S. Candido

Pablo Samuel Castro

Jun Gong

Marlos C. Machado

Subhodeep Moitra

Sameera S. Ponda

Ziyun Wang

2020-11-30

Nature (publié)

doi.org

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications