Jian Tang

chuanrui.wang@mila.quebec

Chuanru Wang

Master's Research - Université de Montréal

farzaneh.heidari@mila.quebec

Farzaneh Heidari

PhD - Université de Montréal

Principal supervisor :

Guillaume Rabusseau

Huiyu Cai

PhD - Université de Montréal

huiyu.cai@mila.quebec

Jianan Zhao

PhD - Université de Montréal

jianan.zhao@mila.quebec

Jerry Lu

PhD - Université de Montréal

jiarui.lu@mila.quebec

mikhail.galkin@mila.quebec

Meng Qu

PhD - Université de Montréal

qumeng@mila.quebec

Michael Galkin

Collaborating researcher

sophie.xhonneux@mila.quebec

Minghao Xu

Collaborating researcher

minghao.xu@mila.quebec

Research Intern - HEC Montréal

nicole.duane@mila.quebec

Sophie Xhonneux

PhD - Université de Montréal

Co-supervisor :

Gauthier Gidel

Xinyu Yuan

PhD - Université de Montréal

xinyu.yuan@mila.quebec

yangtian.zhang@mila.quebec

Yangtian Zhang

Collaborating researcher

Zewen Chi

Research Intern - Beijing Institute of Technology

zewen.chi@mila.quebec

Zhaocheng Zhu

PhD - Université de Montréal

Zhihao Zhan

PhD - Université de Montréal

zhihao.zhan@mila.quebec

Zuobai Zhang

PhD - Université de Montréal

zuobai.zhang@mila.quebec

Publications

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini

Shenyang Huang

Joao Alex Cunha

Zhiyi Li

Gabriela Moisescu-Pareja

Oleksandr Dymov

Samuel Maddrell-Mander

Callum McLean

Frederik Wenkel

Luis Müller

Jama Hussein Mohamud

Ali Parviz

Michael Craig

Michał Koziarski

Jiarui Lu

Zhaocheng Zhu

Cristian Gabellini

Kerstin Klaser

Josef Dean

Cas Wognum … (see 15 more)

Maciej Sypetkowski

Guillaume Rabusseau

Reihaneh Rabbany

Christopher Morris

Ioannis Koutis

Mirco Ravanelli

Guy Wolf

Prudencio Tossou

Hadrien Mary

Therence Bois

Andrew William Fitzgibbon

Blazej Banaszewski

Chad Martin

Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, wh… (see more)ere datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks. The Graphium library is publicly available on Github and the dataset links are available in Part 1 and Part 2.

2024-01-16

ICLR.cc/2024/Conference (poster)

Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing

Shengchao Liu

Weili Nie

Chengpeng Wang

Jiarui Lu

Zhuoran Qiao

Ling Liu

Chaowei Xiao

Animashree Anandkumar

There is increasing adoption of artificial intelligence in drug discovery. However, existing studies use machine learning to mainly utilize … (see more)the chemical structures of molecules but ignore the vast textual knowledge available in chemistry. Incorporating textual knowledge enables us to realize new drug design objectives, adapt to text-based instructions and predict complex biological activities. Here we present a multi-modal molecule structure-text model, MoleculeSTM, by jointly learning molecules' chemical structures and textual descriptions via a contrastive learning strategy. To train MoleculeSTM, we construct a large multi-modal dataset, namely, PubChemSTM, with over 280,000 chemical structure-text pairs. To demonstrate the effectiveness and utility of MoleculeSTM, we design two challenging zero-shot tasks based on text instructions, including structure-text retrieval and molecule editing. MoleculeSTM has two main properties: open vocabulary and compositionality via natural language. In experiments, MoleculeSTM obtains the state-of-the-art generalization ability to novel biochemical concepts across various benchmarks.

2023-12-18

Nature Machine Intelligence (published)

arxiv.org

Pretrainable Geometric Graph Neural Network for Antibody Affinity Maturation

Huiyu Cai

Zuobai Zhang

Mingkai Wang

Bozitao Zhong

Yanling Wu

Tianlei Ying

In the realm of antibody therapeutics development, increasing the binding affinity of an antibody to its target antigen is a crucial task. T… (see more)his paper presents GearBind, a pretrainable deep neural network designed to be effective for in silico affinity maturation. Leveraging multi-level geometric message passing alongside contrastive pretraining on protein structural data, GearBind capably models the complex interplay of atom-level interactions within protein complexes, surpassing previous state-of-the-art approaches on SKEMPI v2 in terms of Pearson correlation, mean absolute error (MAE) and root mean square error (RMSE). In silico experiments elucidate that pretraining helps GearBind become sensitive to mutation-induced binding affinity changes and reflective of amino acid substitution tendency. Using an ensemble model based on pretrained GearBind, we successfully optimize the affinity of CR3022 to the spike (S) protein of the SARS-CoV-2 Omicron strain. Our strategy yields a high success rate with up to 17-fold affinity increase. GearBind proves to be an effective tool in narrowing the search space for in vitro antibody affinity maturation, underscoring the utility of geometric deep learning and adept pre-training in macromolecule interaction modeling.

2023-12-07

bioRxiv (preprint)

Pretrainable Geometric Graph Neural Network for Antibody Affinity Maturation

Huiyu Cai

Zuobai Zhang

Mingkai Wang

Bozitao Zhong

Yanling Wu

Tianlei Ying

2023-12-07

bioRxiv (preprint)

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

Chuanrui Wang

Bozitao Zhong

Zuobai Zhang

Narendra Chaudhary

Sanchit Misra

Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a univers… (see more)ally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the

2023-10-25

NeurIPS.cc/2023/Workshop/AI4D3 (poster)

PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design

Chuanrui Wang

Bozitao Zhong

Zuobai Zhang

Narendra Chaudhary

Sanchit Misra

2023-10-25

NeurIPS.cc/2023/Workshop/AI4D3 (poster)

Large Language Models can Learn Rules

Zhaocheng Zhu

Yuan Xue

Xinyun Chen

Denny Zhou

Dale Schuurmans

Hanjun Dai

2023-10-11

ArXiv (preprint)

GraphText: Graph Reasoning in Text Space

Jianan Zhao

Le Zhuo

Yikang Shen

Meng Qu

Kai Liu

Michael Bronstein

Zhaocheng Zhu

Large Language Models (LLMs) have gained the ability to assimilate human knowledge and facilitate natural language interactions with both hu… (see more)mans and other LLMs. However, despite their impressive achievements, LLMs have not made significant advancements in the realm of graph machine learning. This limitation arises because graphs encapsulate distinct relational data, making it challenging to transform them into natural language that LLMs understand. In this paper, we bridge this gap with a novel framework, GraphText, that translates graphs into natural language. GraphText derives a graph-syntax tree for each graph that encapsulates both the node attributes and inter-node relationships. Traversal of the tree yields a graph text sequence, which is then processed by an LLM to treat graph tasks as text generation tasks. Notably, GraphText offers multiple advantages. It introduces training-free graph reasoning: even without training on graph data, GraphText with ChatGPT can achieve on par with, or even surpassing, the performance of supervised-trained graph neural networks through in-context learning (ICL). Furthermore, GraphText paves the way for interactive graph reasoning, allowing both humans and LLMs to communicate with the model seamlessly using natural language. These capabilities underscore the vast, yet-to-be-explored potential of LLMs in the domain of graph machine learning.

2023-10-02

ArXiv (preprint)

arxiv.org

An Empirical Study of Retrieval-Enhanced Graph Neural Networks

Dingmin Wang

Shengchao Liu

Hanchen Wang

Bernardo Cuenca Grau

Linfeng Song

Le Song

Qi Liu

Graph Neural Networks (GNNs) are effective tools for graph representation learning. Most GNNs rely on a recursive neighborhood aggregation s… (see more)cheme, named message passing, thereby their theoretical expressive power is limited to the first-order Weisfeiler-Lehman test (1-WL). An effective approach to this challenge is to explicitly retrieve some annotated examples used to enhance GNN models. While retrieval-enhanced models have been proved to be effective in many language and vision domains, it remains an open question how effective retrieval-enhanced GNNs are when applied to graph datasets. Motivated by this, we want to explore how the retrieval idea can help augment the useful information learned in the graph neural networks, and we design a retrieval-enhanced scheme called GRAPHRETRIEVAL, which is agnostic to the choice of graph neural network models. In GRAPHRETRIEVAL, for each input graph, similar graphs together with their ground-true labels are retrieved from an existing database. Thus they can act as a potential enhancement to complete various graph property predictive tasks. We conduct comprehensive experiments over 13 datasets, and we observe that GRAPHRETRIEVAL is able to reach substantial improvements over existing GNNs. Moreover, our empirical study also illustrates that retrieval enhancement is a promising remedy for alleviating the long-tailed label distribution problem.

2023-09-28

Frontiers in Artificial Intelligence and Applications (published)

arxiv.org

Evaluating Self-Supervised Learning for Molecular Graph Embeddings

Hanchen Wang

Jean Kaddour

Shengchao Liu

Matt J. Kusner

Joan Lasenby

Qi Liu

Graph Self-Supervised Learning (GSSL) provides a robust pathway for acquiring embeddings without expert labelling, a capability that carries… (see more) profound implications for molecular graphs due to the staggering number of potential molecules and the high cost of obtaining labels. However, GSSL methods are designed not for optimisation within a specific domain but rather for transferability across a variety of downstream tasks. This broad applicability complicates their evaluation. Addressing this challenge, we present"Molecular Graph Representation Evaluation"(MOLGRAPHEVAL), generating detailed profiles of molecular graph embeddings with interpretable and diversified attributes. MOLGRAPHEVAL offers a suite of probing tasks grouped into three categories: (i) generic graph, (ii) molecular substructure, and (iii) embedding space properties. By leveraging MOLGRAPHEVAL to benchmark existing GSSL methods against both current downstream datasets and our suite of tasks, we uncover significant inconsistencies between inferences drawn solely from existing datasets and those derived from more nuanced probing. These findings suggest that current evaluation methodologies fail to capture the entirety of the landscape.

Symmetry-Informed Geometric Representation for Molecules, Proteins, and Crystalline Materials

Shengchao Liu

weitao Du

Yanjing Li

Zhuoxinran Li

Zhiling Zheng

Chenru Duan

Zhi-Ming Ma

Omar M. Yaghi

Animashree Anandkumar

Christian Borgs

Jennifer T Chayes

Hongyu Guo

Artificial intelligence for scientific discovery has recently generated significant interest within the machine learning and scientific comm… (see more)unities, particularly in the domains of chemistry, biology, and material discovery. For these scientific problems, molecules serve as the fundamental building blocks, and machine learning has emerged as a highly effective and powerful tool for modeling their geometric structures. Nevertheless, due to the rapidly evolving process of the field and the knowledge gap between science ({\eg}, physics, chemistry, \& biology) and machine learning communities, a benchmarking study on geometrical representation for such data has not been conducted. To address such an issue, in this paper, we first provide a unified view of the current symmetry-informed geometric methods, classifying them into three main categories: invariance, equivariance with spherical frame basis, and equivariance with vector frame basis. Then we propose a platform, coined Geom3D, which enables benchmarking the effectiveness of geometric strategies. Geom3D contains 16 advanced symmetry-informed geometric representation models and 14 geometric pretraining methods over 52 diverse tasks, including small molecules, proteins, and crystalline materials. We hope that Geom3D can, on the one hand, eliminate barriers for machine learning researchers interested in exploring scientific problems; and, on the other hand, provide valuable guidance for researchers in computational chemistry, structural biology, and materials science, aiding in the informed selection of representation techniques for specific applications. The source code is available on \href{https://github.com/chao1224/Geom3D}{the GitHub repository}.

A*Net: A Scalable Path-based Reasoning Approach for Knowledge Graphs

Zhaocheng Zhu

Xinyu Yuan

Mikhail Galkin

Louis-Pascal Xhonneux

Sophie Xhonneux

Ming Zhang

Maxime Gazeau