Publications

FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning
Songtao Liu
Zhengkai Tu
Minkai Xu
Zuobai Zhang
Peilin Zhao
Rex Ying
Lu Lin
Dinghao Wu
Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to find a complete multi-step syntheti… (see more)c route from a set of starting materials to the target molecule, determining crucial process flow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to find synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efficiently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model sig-nificantly improves the performance of retrosyn-thetic planning over baselines that are not context-aware, especially for long synthetic routes.
Gap Minimization for Knowledge Sharing and Transfer
Boyu Wang
Jorge A. Mendez
Changjian Shui
Fan Zhou
Di Wu
Gezheng Xu
Eric R. Eaton
Learning from multiple related tasks by knowledge sharing and transfer has become increasingly relevant over the last two decades. In order … (see more)to successfully transfer information from one task to another, it is critical to understand the similarities and differences between the domains. In this paper, we introduce the notion of \emph{performance gap}, an intuitive and novel measure of the distance between learning tasks. Unlike existing measures which are used as tools to bound the difference of expected risks between tasks (e.g.,
General Purpose AI Systems in the AI Act: Trying to Fit a Square Peg Into a Round Hole
Claire Boine
Generating QM1B with PySCF$_{\text{IPU}}$
Alexander Mathiasen
Hatem Helal
Kerstin Klaeser
Paul Balanca
Josef Dean
Carlo Luschi
Andrew William Fitzgibbon
Dominic Masters
Generating QM1B with PySCFIPU
Alexander Mathiasen
Hatem Helal
Kerstin Klaser
Paul Balanca
Josef Dean
Carlo Luschi
Andrew William Fitzgibbon
Dominic Masters
GEODESIC SINKHORN FOR FAST AND ACCURATE OPTIMAL TRANSPORT ON MANIFOLDS
Guillaume Huguet
Alexander Tong
María Ramos Zapatero
Christopher J. Tape
Smita Krishnaswamy
Efficient computation of optimal transport distance between distributions is of growing importance in data science. Sinkhorn-based methods a… (see more)re currently the state-of-the-art for such computations, but require O(n2) computations. In addition, Sinkhorn-based methods commonly use an Euclidean ground distance between datapoints. However, with the prevalence of manifold structured scientific data, it is often desirable to consider geodesic ground distance. Here, we tackle both issues by proposing Geodesic Sinkhorn—based on diffusing a heat kernel on a manifold graph. Notably, Geodesic Sinkhorn requires only O(n log n) computation, as we approximate the heat kernel with Chebyshev polynomials based on the sparse graph Laplacian. We apply our method to the computation of barycenters of several distributions of high dimensional single cell data from patient samples undergoing chemotherapy. In particular, we define the barycentric distance as the distance between two such barycenters. Using this definition, we identify an optimal transport distance and path associated with the effect of treatment on cellular data.
GFlowNets for AI-Driven Scientific Discovery
Moksh J. Jain
Tristan Deleu
Jason Hartford
Cheng-Hao Liu
Alex Hernandez-Garcia
Tackling the most pressing problems for humanity, such as the climate crisis and the threat of global pandemics, requires accelerating the p… (see more)ace of scientific discovery. While science has traditionally relied...
GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu
Moksh J. Jain
Bonaventure F. P. Dossou
Qianli Shen
Salem Lahlou
Anirudh Goyal
Nikolay Malkin
Chris Emezue
Dinghuai Zhang
Nadhir Hassen
Xu Ji
Kenji Kawaguchi
GFlowOut: Dropout with Generative Flow Networks
Dianbo Liu
Moksh J. Jain
Bonaventure F. P. Dossou
Qianli Shen
Salem Lahlou
Anirudh Goyal
Nikolay Malkin
Chris Emezue
Dinghuai Zhang
Nadhir Hassen
Xu Ji
Kenji Kawaguchi
GitHub Copilot AI pair programmer: Asset or Liability?
Arghavan Moradi Dakhel
Vahid Majdinasab
Amin Nikanjam
Michel C. Desmarais
Z. Jiang
Automatic program synthesis is a long-lasting dream in software engineering. Recently, a promising Deep Learning (DL) based solution, called… (see more) Copilot, has been proposed by OpenAI and Microsoft as an industrial product. Although some studies evaluate the correctness of Copilot solutions and report its issues, more empirical evaluations are necessary to understand how developers can benefit from it effectively. In this paper, we study the capabilities of Copilot in two different programming tasks: (i) generating (and reproducing) correct and efficient solutions for fundamental algorithmic problems, and (ii) comparing Copilot's proposed solutions with those of human programmers on a set of programming tasks. For the former, we assess the performance and functionality of Copilot in solving selected fundamental problems in computer science, like sorting and implementing data structures. In the latter, a dataset of programming problems with human-provided solutions is used. The results show that Copilot is capable of providing solutions for almost all fundamental algorithmic problems, however, some solutions are buggy and non-reproducible. Moreover, Copilot has some difficulties in combining multiple methods to generate a solution. Comparing Copilot to humans, our results show that the correct ratio of humans' solutions is greater than Copilot's suggestions, while the buggy solutions generated by Copilot require less effort to be repaired.
GOKU-UI: Ubiquitous Inference through Attention and Multiple Shooting for Continuous-time Generative Models
Germán Abrevaya
Mahta Ramezanian-Panahi
Jean-Christophe Gagnon-Audet
Pablo Polosecki
Silvina Ponce Dawson
Guillermo Cecchi
Scientific Machine Learning (SciML) is a burgeoning field that synergistically combines domain-aware and interpretable models with agnosti… (see more)c machine learning techniques. In this work, we introduce GOKU-UI, an evolution of the SciML generative model GOKU-nets. The GOKU-UI broadens the original model’s spectrum to incorporate other classes of differential equations, such as Stochastic Differential Equations (SDEs), and integrates a distributed, i.e. ubiquitous, inference through attention mechanisms and a novel multiple shooting training strategy in the latent space. These enhancements have led to a significant increase in its performance in both reconstruction and forecast tasks, as demonstrated by our evaluation of simulated and empirical data. Specifically, GOKU-UI outperformed all baseline models on synthetic datasets even with a training set 32-fold smaller, underscoring its remarkable data efficiency. Furthermore, when applied to empirical human brain data, while incorporating stochastic Stuart-Landau
Gradient Masked Averaging for Federated Learning
Irene Tenison
Sai Aravind Sreeramadas
Vaikkunth Mugunthan
Edouard Oyallon
Federated learning (FL) is an emerging paradigm that permits a large number of clients with heterogeneous data to coordinate learning of a u… (see more)nified global model without the need to share data amongst each other. A major challenge in federated learning is the heterogeneity of data across client, which can degrade the performance of standard FL algorithms. Standard FL algorithms involve averaging of model parameters or gradient updates to approximate the global model at the server. However, we argue that in heterogeneous settings, averaging can result in information loss and lead to poor generalization due to the bias induced by dominant client gradients. We hypothesize that to generalize better across non-i.i.d datasets, the algorithms should focus on learning the invariant mechanism that is constant while ignoring spurious mechanisms that differ across clients. Inspired from recent works in Out-of-Distribution generalization, we propose a gradient masked averaging approach for FL as an alternative to the standard averaging of client updates. This aggregation technique for client updates can be adapted as a drop-in replacement in most existing federated algorithms. We perform extensive experiments on multiple FL algorithms with in-distribution, real-world, feature-skewed out-of-distribution, and quantity imbalanced datasets and show that it provides consistent improvements, particularly in the case of heterogeneous clients.