Portrait de Petar Veličković n'est pas disponible

Petar Veličković

Alumni

Publications

Transformers meet Neural Algorithmic Reasoners
Wilfried Bounsi
Borja Ibarz
Andrew Joseph Dudzik
Jessica B. Hamrick
Larisa Markeeva
Alex Vitvitskyi
Transformers have revolutionized machine learning with their simple yet effective architecture. Pre-training Transformers on massive text da… (voir plus)tasets from the Internet has led to unmatched generalization for natural language understanding (NLU) tasks. However, such language models remain fragile when tasked with algorithmic forms of reasoning, where computations must be precise and robust. To address this limitation, we propose a novel approach that combines the Transformer's language understanding with the robustness of graph neural network (GNN)-based neural algorithmic reasoners (NARs). Such NARs proved effective as generic solvers for algorithmic tasks, when specified in graph form. To make their embeddings accessible to a Transformer, we propose a hybrid architecture with a two-phase training procedure, allowing the tokens in the language model to cross-attend to the node embeddings from the NAR. We evaluate our resulting TransNAR model on CLRS-Text, the text-based version of the CLRS-30 benchmark, and demonstrate significant gains over Transformer-only models for algorithmic reasoning, both in and out of distribution.
Transformers need glasses! Information over-squashing in language tasks
Federico Barbero
Andrea Banino
Steven Kapturowski
Dharshan Kumaran
João Guilherme Madeira Araújo
Alex Vitvitskyi
We study how information propagates in decoder-only Transformers, which are the architectural backbone of most existing frontier large langu… (voir plus)age models (LLMs). We rely on a theoretical signal propagation analysis -- specifically, we analyse the representations of the last token in the final layer of the Transformer, as this is the representation used for next-token prediction. Our analysis reveals a representational collapse phenomenon: we prove that certain distinct sequences of inputs to the Transformer can yield arbitrarily close representations in the final token. This effect is exacerbated by the low-precision floating-point formats frequently used in modern LLMs. As a result, the model is provably unable to respond to these sequences in different ways -- leading to errors in, e.g., tasks involving counting or copying. Further, we show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input, which relates to the well-known phenomenon of over-squashing in graph neural networks. We provide empirical evidence supporting our claims on contemporary LLMs. Our theory also points to simple solutions towards ameliorating these issues.
Asynchronous Algorithmic Alignment with Cocycles
Andrew Joseph Dudzik
Tamara von Glehn
State-of-the-art neural algorithmic reasoners make use of message passing in graph neural networks (GNNs). But typical GNNs blur the distinc… (voir plus)tion between the definition and invocation of the message function, forcing a node to send messages to its neighbours at every layer, synchronously. When applying GNNs to learn to execute dynamic programming algorithms, however, on most steps only a handful of the nodes would have meaningful updates to send. One, hence, runs the risk of inefficiencies by sending too much irrelevant data across the graph. But more importantly, many intermediate GNN steps have to learn the identity functions, which is a non-trivial learning problem. In this work, we explicitly separate the concepts of node state update and message function invocation. With this separation, we obtain a mathematical formulation that allows us to reason about asynchronous computation in both algorithms and neural networks. Our analysis yields several practical implementations of synchronous scalable GNN layers that are provably invariant under various forms of asynchrony.
Latent Space Representations of Neural Algorithmic Reasoners
Vladimir V. Mirjani'c
Petar Velivckovi'c University of Cambridge
Google Deepmind
Neural Algorithmic Reasoning (NAR) is a research area focused on designing neural architectures that can reliably capture classical computat… (voir plus)ion, usually by learning to execute algorithms. A typical approach is to rely on Graph Neural Network (GNN) architectures, which encode inputs in high-dimensional latent spaces that are repeatedly transformed during the execution of the algorithm. In this work we perform a detailed analysis of the structure of the latent space induced by the GNN when executing algorithms. We identify two possible failure modes: (i) loss of resolution, making it hard to distinguish similar values; (ii) inability to deal with values outside the range observed during training. We propose to solve the first issue by relying on a softmax aggregator, and propose to decay the latent space in order to deal with out-of-range values. We show that these changes lead to improvements on the majority of algorithms in the standard CLRS-30 benchmark when using the state-of-the-art Triplet-GMPNN processor. Our code is available at https://github.com/mirjanic/nar-latent-spaces
softmax is not enough (for sharp out-of-distribution)
Christos Perivolaropoulos
Federico Barbero
A key property of reasoning systems is the ability to make sharp decisions on their input data. For contemporary AI systems, a key carrier o… (voir plus)f sharp behaviour is the softmax function, with its capability to perform differentiable query-key lookups. It is a common belief that the predictive power of networks leveraging softmax arises from "circuits" which sharply perform certain kinds of computations consistently across many diverse inputs. However, for these circuits to be robust, they would need to generalise well to arbitrary valid inputs. In this paper, we dispel this myth: even for tasks as simple as finding the maximum key, any learned circuitry must disperse as the number of items grows at test time. We attribute this to a fundamental limitation of the softmax function to robustly approximate sharp functions, prove this phenomenon theoretically, and propose adaptive temperature as an ad-hoc technique for improving the sharpness of softmax at inference time.
Finding Increasingly Large Extremal Graphs with AlphaZero and Tabu Search
Abbas Mehrabian
Hyunjik Kim
Nicolas Sonnerat
Matej Balog
Gheorghe Comanici
Tudor Berariu
Andrew Lee
Anian Ruoss
Anna Bulanova
Daniel Toyama
Sam Blackwell
Bernardino Romera Paredes
Laurent Orseau
Joonkyung Lee
Anurag Murty Naredla
Adam Zsolt Wagner
Scientific discovery in the age of artificial intelligence
Hanchen Wang
Tianfan Fu
Yuanqi Du
Wenhao Gao
Kexin Huang
Ziming Liu
Payal Chandak
Peter Van Katwyk
Andreea Deac
Animashree Anandkumar
K. Bergen
Carla P. Gomes
Shirley Ho
Pushmeet Kohli
Joan Lasenby
Jure Leskovec
Tie-Yan Liu
A. Manrai
Debora Susan Marks … (voir 10 de plus)
Bharath Ramsundar
Le Song
Jimeng Sun
Max Welling
Linfeng Zhang
Connor Wilson. Coley
Marinka Žitnik
Principal Neighbourhood Aggregation for Graph Nets
Gabriele Corso
Luca Cavalleri
Pietro Lio
Deep Graph Infomax
William Fedus
William L. Hamilton
Pietro Lio
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised ma… (voir plus)nner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.
Deep Graph Infomax
William Fedus
William L. Hamilton
Pietro Lio
We present Deep Graph Infomax (DGI), a general approach for learning node representations within graph-structured data in an unsupervised ma… (voir plus)nner. DGI relies on maximizing mutual information between patch representations and corresponding high-level summaries of graphs---both derived using established graph convolutional network architectures. The learnt patch representations summarize subgraphs centered around nodes of interest, and can thus be reused for downstream node-wise learning tasks. In contrast to most prior approaches to unsupervised learning with GCNs, DGI does not rely on random walk objectives, and is readily applicable to both transductive and inductive learning setups. We demonstrate competitive performance on a variety of node classification benchmarks, which at times even exceeds the performance of supervised learning.
Deep Graph Infomax
William Fedus
William L. Hamilton
Pietro Lio
Deep Graph Infomax
William Fedus
William L. Hamilton
Pietro Lio