Publications

Graph Inductive Biases in Transformers without Message Passing

Liheng Ma

Chen Lin

Derek Lim

Adriana Romero Soriano

Puneet K. Dokania

Mark Coates

Philip Torr

Ser-Nam Lim

Transformers for graph data are increasingly widely studied and successful in numerous learning tasks. Graph inductive biases are crucial fo… (see more)r Graph Transformers, and previous works incorporate them using message-passing modules and/or positional encodings. However, Graph Transformers that use message-passing inherit known issues of message-passing, and differ significantly from Transformers used in other domains, thus making transfer of research advances more difficult. On the other hand, Graph Transformers without message-passing often perform poorly on smaller datasets, where inductive biases are more crucial. To bridge this gap, we propose the Graph Inductive bias Transformer (GRIT) -- a new Graph Transformer that incorporates graph inductive biases without using message passing. GRIT is based on several architectural changes that are each theoretically and empirically justified, including: learned relative positional encodings initialized with random walk probabilities, a flexible attention mechanism that updates node and node-pair representations, and injection of degree information in each layer. We prove that GRIT is expressive -- it can express shortest path distances and various graph propagation matrices. GRIT achieves state-of-the-art empirical performance across a variety of graph datasets, thus showing the power that Graph Transformers without message-passing can deliver.

2023-01-01

ICML (published)

doi.org

openreview.net

Graph Inductive Biases in Transformers without Message Passing

Liheng Ma

Chen Lin

Derek Lim

Adriana Romero Soriano

Puneet K. Dokania

Mark Coates

Philip Torr

Ser-Nam Lim

Transformers for graph data are increasingly widely studied and successful in numerous learning tasks. Graph inductive biases are crucial fo… (see more)r Graph Transformers, and previous works incorporate them using message-passing modules and/or positional encodings. However, Graph Transformers that use message-passing inherit known issues of message-passing, and differ significantly from Transformers used in other domains, thus making transfer of research advances more difficult. On the other hand, Graph Transformers without message-passing often perform poorly on smaller datasets, where inductive biases are more crucial. To bridge this gap, we propose the Graph Inductive bias Transformer (GRIT) — a new Graph Transformer that incorporates graph inductive biases without using message passing. GRIT is based on several architectural changes that are each theoretically and empirically justified, including: learned relative positional encodings initialized with random walk probabilities, a flexible attention mechanism that updates node and node-pair representations, and injection of degree information in each layer. We prove that GRIT is expressive — it can express shortest path distances and various graph propagation matrices. GRIT achieves state-of-the-art empirical performance across a variety of graph datasets, thus showing the power that Graph Transformers without message-passing can deliver.

2023-01-01

ICML (published)

doi.org

openreview.net

Graphically Structured Diffusion Models

Christian Dietrich Weilbach

William Harvey

Frank N. Wood

We introduce a framework for automatically defining and learning deep generative models with problem-specific structure. We tackle problem d… (see more)omains that are more traditionally solved by algorithms such as sorting, constraint satisfaction for Sudoku, and matrix factorization. Concretely, we train diffusion models with an architecture tailored to the problem specification. This problem specification should contain a graphical model describing relationships between variables, and often benefits from explicit representation of subcomputations. Permutation invariances can also be exploited. Across a diverse set of experiments we improve the scaling relationship between problem dimension and our model's performance, in terms of both training time and final accuracy.

2023-01-01

ICML (published)

doi.org

openreview.net

GROOD: GRadient-aware Out-Of-Distribution detection in interpolated manifolds

Mostafa ElAraby

Sabyasachi Sahoo

Yann Batiste Pequignot

Paul Novello

Liam Paull

2023-01-01

arXiv.org (preprint)

doi.org

arxiv.org

A Group Symmetric Stochastic Differential Equation Model for Molecule Multi-modal Pretraining

Shengchao Liu

weitao Du

Zhi-Ming Ma

Hongyu Guo

Jian Tang

Molecule pretraining has quickly become the go-to schema to boost the performance of AI-based drug discovery. Naturally, molecules can be re… (see more)presented as 2D topological graphs or 3D geometric point clouds. Although most existing pertaining methods focus on merely the single modality, recent research has shown that maximizing the mutual information (MI) between such two modalities enhances the molecule representation ability. Meanwhile, existing molecule multi-modal pretraining approaches approximate MI based on the representation space encoded from the topology and geometry, thus resulting in the loss of critical structural information of molecules. To address this issue, we propose MoleculeSDE. MoleculeSDE leverages group symmetric (e.g., SE(3)-equivariant and reflection-antisymmetric) stochastic differential equation models to generate the 3D geometries from 2D topologies, and vice versa, directly in the input space. It not only obtains tighter MI bound but also enables prosperous downstream tasks than the previous work. By comparing with 17 pretraining baselines, we empirically verify that MoleculeSDE can learn an expressive representation with state-of-the-art performance on 26 out of 32 downstream tasks.

2023-01-01

ICML (published)

doi.org

openreview.net

Guessing Random Additive Noise Decoding

Syed Mohsin Abbas

Marwan Jalaleddine

Warren Gross

2023-01-01

(published)

doi.org

Guiding Language Model Math Reasoning with Planning Tokens

Xinyi Wang

Lucas Caccia

Oleksiy Ostapenko

Xingdi Yuan

William Yang Wang

Alessandro Sordoni

Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as cha… (see more)in-of-thought reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. We find that while LLMs can manage individual reasoning steps well, they struggle with maintaining consistency across an entire reasoning chain. To solve this, we introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets w.r.t. standard fine-tuning baselines.

2023-01-01

arXiv.org (preprint)

doi.org

arxiv.org

GUILGET: GUI Layout GEneration with Transformer

Andrey Sobolevsky

Guillaume-Alexandre Bilodeau

Jinghui Cheng

Jin Guo

2023-01-01

Canadian AI (published)

doi.org

arxiv.org

Guillotine Regularization: Why removing layers is needed to improve generalization in Self-Supervised Learning

Florian Bordes

Randall Balestriero

Quentin Garrido

Adrien Bardes

Pascal Vincent

One unexpected technique that emerged in recent years consists in training a Deep Network (DN) with a Self-Supervised Learning (SSL) method,… (see more) and using this network on downstream tasks but with its last few projector layers entirely removed. This trick of throwing away the projector is actually critical for SSL methods to display competitive performances on ImageNet for which more than 30 percentage points can be gained that way. This is a little vexing, as one would hope that the network layer at which invariance is explicitly enforced by the SSL criterion during training (the last projector layer) should be the one to use for best generalization performance downstream. But it seems not to be, and this study sheds some light on why. This trick, which we name Guillotine Regularization (GR), is in fact a generically applicable method that has been used to improve generalization performance in transfer learning scenarios. In this work, we identify the underlying reasons behind its success and show that the optimal layer to use might change significantly depending on the training setup, the data or the downstream task. Lastly, we give some insights on how to reduce the need for a projector in SSL by aligning the pretext SSL task and the downstream task.

2023-01-01

Trans. Mach. Learn. Res. (published)

openreview.net

A Heat Diffusion Perspective on Geodesic Preserving Dimensionality Reduction

Guillaume Huguet

Alexander Tong

Edward De Brouwer

Yanlei Zhang

Guy Wolf

Ian Adelstein

Smita Krishnaswamy

openreview.net

Hierarchical Distributed Energy Management Framework for Multiple Greenhouses Considering Demand Response

Ehsan Rezaei

Hanane Dagdougui

Kianoosh Ojand

Greenhouses are a key component of modernised agriculture, aiming for producing high-quality crops and plants. Furthermore, a network of gre… (see more)enhouses has enormous potential as part of demand response programs. Saving energy during off-peak time, reducing power consumption and delaying the start time of subsystems during on-peak time are some strategies that can be used to limit power exchanged with the main grid. In this work, a hierarchical distributed alternating direction method of multipliers-based model predictive control framework is proposed that has two main objectives: 1) providing appropriate conditions for greenhouses' crops and plants to grow, and 2) limiting the total power exchanged with the main grid. At each time step in the framework, an aggregator coordinates the greenhouses to reach a consensus and limit the total electric power exchanged while managing shared resources, e.g., reservoir water. The proposed framework's performance is investigated through a case study.

2023-01-01

IEEE Transactions on Sustainable Energy (published)

doi.org

How can intelligent systems revolutionise healthcare?

Narges Armanfard

2023-01-01

Futurum Careers (published)

doi.org

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

Hugo Larochelle appointed Scientific Director of Mila

Publications

Leveraging AI for a Sustainable Future

Mila AI Policy Fellowship

Hugo Larochelle appointed Scientific Director of Mila

Popular keywords:

Publications