Zuobai Zhang

Pre-training Protein Structure Encoder via Siamese Diffusion Trajectory Prediction

Zuobai Zhang

Minghao Xu

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Due to the determining role of protein structures on diverse protein functions, pre-training representations of proteins on massive unlabele… (see more)d protein structures has attracted rising research interests. Among recent efforts on this direction, mutual information (MI) maximization based methods have gained the superiority on various downstream benchmark tasks. The core of these methods is to design correlated views that share common information about a protein. Previous view designs focus on capturing structural motif co-occurrence on the same protein structure, while they cannot capture detailed atom/residue interactions. To address this limitation, we propose the Siamese Diffusion Trajectory Prediction (SiamDiff) method. SiamDiff builds a view as the trajectory that gradually approaches protein native structure from scratch, which facilitates the modeling of atom/residue interactions underlying the protein structural dynamics. Specifically, we employ the multimodal diffusion process as a faithful simulation of the structure-sequence co-diffusion trajectory, where rich patterns of protein structural changes are embedded. On such basis, we design a principled theoretical framework to maximize the MI between correlated multimodal diffusion trajectories. We study the effectiveness of SiamDiff on both residue-level and atom-level structures. On the EC and ATOM3D benchmarks, we extensively compare our method with previous protein structure pre-training approaches. The experimental results verify the consistently superior or competitive performance of SiamDiff on all benchmark tasks compared to existing baselines. The source code will be made public upon acceptance.

2023-02-01

ICLR.cc/2023/Conference (rejected)

openreview.net

Protein Representation Learning by Geometric Structure Pretraining

Zuobai Zhang

Minghao Xu

Arian Rokkum Jamasb

Vijil Chenthamarakshan

Aurelie Lozano

Payel Das

Jian Tang

Learning effective protein representations is critical in a variety of tasks in biology such as predicting protein function or structure. Ex… (see more)isting approaches usually pretrain protein language models on a large number of unlabeled amino acid sequences and then finetune the models with some labeled data in downstream tasks. Despite the effectiveness of sequence-based approaches, the power of pretraining on known protein structures, which are available in smaller numbers only, has not been explored for protein property prediction, though protein structures are known to be determinants of protein function. In this paper, we propose to pretrain protein representations according to their 3D structures. We first present a simple yet effective encoder to learn the geometric features of a protein. We pretrain the protein graph encoder by leveraging multiview contrastive learning and different self-prediction tasks. Experimental results on both function prediction and fold classification tasks show that our proposed pretraining methods outperform or are on par with the state-of-the-art sequence-based methods, while using much less pretraining data. Our implementation is available at https://github.com/DeepGraphLearning/GearNet.

2023-02-01

ICLR.cc/2023/Conference (poster)

doi.org

openreview.net

FusionRetro: Molecule Representation Fusion via In-Context Learning for Retrosynthetic Planning

Songtao Liu

Zhengkai Tu

Minkai Xu

Zuobai Zhang

Lu Lin

Rex Ying

Zhitao Ying

Jian Tang

Peilin Zhao

Dinghao Wu

2023-01-01

ICML (published)

openreview.net

FusionRetro: Molecule Representation Fusion via Reaction Graph for Retrosynthetic Planning

Songtao Liu

Zhengkai Tu

Minkai Xu

Zuobai Zhang

Peilin Zhao

Jian Tang

Rex Ying

Lu Lin

Dinghao Wu

Retrosynthetic planning is a fundamental problem in drug discovery and organic chemistry, which aims to ﬁnd a complete multi-step syntheti… (see more)c route from a set of starting materials to the target molecule, determining crucial process ﬂow in chemical production. Existing approaches combine single-step retrosynthesis models and search algorithms to ﬁnd synthetic routes. However, these approaches generally consider the two pieces in a decoupled manner, taking only the product as the input to predict the reactants per planning step and largely ignoring the important context information from other intermediates along the synthetic route. In this work, we perform a series of experiments to identify the limitations of this decoupled view and propose a novel retrosynthesis framework that also exploits context information for retrosynthetic planning. We view synthetic routes as reaction graphs, and propose to incorporate the context by three principled steps: encode molecules into embeddings, aggregate information over routes, and readout to predict reactants. The whole framework can be efﬁciently optimized in an end-to-end fashion. Comprehensive experiments show that by fusing in context information over routes, our model sig-niﬁcantly improves the performance of retrosyn-thetic planning over baselines that are not context-aware, especially for long synthetic routes.

2023-01-01

(published)

www.semanticscholar.org

Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction

Zuobai Zhang

Minghao Xu

Aurelie Lozano

Vijil Chenthamarakshan

Payel Das

Jian Tang

Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their join… (see more)t energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pre-train a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory, which acquires the joint distribution of sequences and structures. Considering the essential protein conformational variations, we enhance DiffPreT by a physics-inspired method called Siamese Diffusion Trajectory Prediction ( SiamDiff ) to capture the correlation between different conformers of a protein. SiamDiff attains this goal by maximizing the mutual information between representations of diffusion trajectories of structurally-correlated conformers. We study the effectiveness of DiffPreT and SiamDiff on both atom-and residue-level structure-based protein understanding tasks. Experimental results show that the performance of DiffPreT is consistently competitive on all tasks, and SiamDiff achieves new state-of-the-art performance, considering the mean ranks on all tasks. The source code will be released upon acceptance.

2023-01-01

arXiv.org (preprint)

doi.org

Metro: Memory-Enhanced Transformer for Retrosynthetic Planning via Reaction Tree

Songtao Liu

Rex Ying

Zhitao Ying

Zuobai Zhang

Peilin Zhao

Jian Tang

Lu Lin

Dinghao Wu

Retrosynthetic planning plays a critical role in drug discovery and organic chemistry. Starting from a target molecule as the root node, it … (see more)aims to find a complete reaction tree subject to the constraint that all leaf nodes belong to a set of starting materials. The multi-step reactions are crucial because they determine the flow chart in the production of the Organic Chemical Industry. However, existing datasets lack curation of tree-structured multi-step reactions, and fail to provide such reaction trees, limiting models’ understanding of organic molecule transformations. In this work, we first develop a benchmark curated for the retrosynthetic planning task, which consists of 124,869 reaction trees retrieved from the public USPTO-full dataset. On top of that, we propose Metro: Memory-Enhanced Transformer for RetrOsynthetic planning. Specifically, the dependency among molecules in the reaction tree is captured as context information for multi-step retrosynthesis predictions through transformers with a memory module. Extensive experiments show that Metro dramatically outperforms existing single-step retrosynthesis models by at least 10.7% in top-1 accuracy. The experiments demonstrate the superiority of exploiting context information in the retrosynthetic planning task. Moreover, the proposed model can be directly used for synthetic accessibility analysis, as it is trained on reaction trees with the shortest depths. Our work is the first step towards a brand new formulation for retrosynthetic planning in the aspects of data construction, model design, and evaluation. Code is available at https://github.com/SongtaoLiu0823/metro.

2022-01-01

arXiv.org (preprint)

doi.org

openreview.net

AI Research Driven by Real-World Problems

AI Policy Compass

Student Life and Resources

Zuobai Zhang

Publications

AI Research Driven by Real-World Problems

AI Policy Compass

Student Life and Resources

Popular keywords:

Zuobai Zhang

Publications