Aaron Courville

Reza Bayat

PhD - Université de Montréal

Co-supervisor :

Pascal Vincent

Anirudh Buvanesh

PhD - Université de Montréal

Principal supervisor :

Laurent Charlin

anirudb1102@gmail.com

Razvan Ciuca

Master's Research - Université de Montréal

Alexandre Diz Ganito

Master's Research - Université de Montréal

Juan Duque

PhD - Université de Montréal

Sarvjeet Ghotra

PhD - Université de Montréal

Arian Hosseini

PhD - Université de Montréal

Uday Kapur

Professional Master's - Université de Montréal

Amr Khalifa

PhD - Université de Montréal

andrei.nicolicioiu@gmail.com

Samuel Lavoie

PhD - Université de Montréal

Zhixuan Lin

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

PhD - Université de Montréal

Co-supervisor :

Rishabh Agarwal

Andrei Nicolicioiu

PhD - Université de Montréal

Evgenii Nikishin

PhD - Université de Montréal

Principal supervisor :

PhD - Université de Montréal

Co-supervisor :

Johan Samir Obando Ceron

PhD - Université de Montréal

Co-supervisor :

Master's Research - Université de Montréal

pichedereck@gmail.com

Esra'a Saleh

PhD - Université de Montréal

Principal supervisor :

Master's Research - Université de Montréal

Principal supervisor :

Anna (Cheng-Zhi) Huang

Shawn Tan

PhD - Université de Montréal

PhD - Université de Montréal

Principal supervisor :

(Rex) Devon Hjelm

Ankit Vani

PhD - Université de Montréal

Yusong Wu

PhD - Université de Montréal

Principal supervisor :

Anna (Cheng-Zhi) Huang

Xiaofeng Zhang

PhD - Université de Montréal

Dinghuai Zhang

PhD - Université de Montréal

Co-supervisor :

Yoshua Bengio

Hattie Zhou

PhD - Université de Montréal

Principal supervisor :

Hugo Larochelle

Publications

Sparse Universal Transformer

Shawn Tan

Yikang Shen

Zhenfang Chen

Chuang Gan

The Universal Transformer (UT) is a variant of the Transformer that shares parameters across its layers and is Turing-complete under certain… (see more) assumptions. Empirical evidence also shows that UTs have better compositional generalization than Vanilla Transformers (VTs) in formal language tasks. The parameter-sharing also affords it better parameter efficiency than VTs. Despite its many advantages, most state-of-the-art NLP systems use VTs as their backbone model instead of UTs. This is mainly because scaling UT parameters is more compute and memory intensive than scaling up a VT. This paper proposes the Sparse Universal Transformer (SUT), which leverages Sparse Mixture of Experts (SMoE) to reduce UT's computation complexity while retaining its parameter efficiency and generalization ability. Experiments show that SUT combines the best of both worlds, achieving strong generalization results on formal language tasks (Logical inference and CFQ) and impressive parameter and computation efficiency on standard natural language benchmarks like WMT'14.

2023-10-07

EMNLP/2023/Conference (accepted)

Diffusion Generative Flow Samplers: Improving learning signals through partial trajectory optimization

Dinghuai Zhang

Ricky T. Q. Chen

Cheng-Hao Liu

Yoshua Bengio

We tackle the problem of sampling from intractable high-dimensional density functions, a fundamental task that often appears in machine lear… (see more)ning and statistics. We extend recent sampling-based approaches that leverage controlled stochastic processes to model approximate samples from these target densities. The main drawback of these approaches is that the training objective requires full trajectories to compute, resulting in sluggish credit assignment issues due to use of entire trajectories and a learning signal present only at the terminal time. In this work, we present Diffusion Generative Flow Samplers (DGFS), a sampling-based framework where the learning process can be tractably broken down into short partial trajectory segments, via parameterizing an additional"flow function". Our method takes inspiration from the theory developed for generative flow networks (GFlowNets), allowing us to make use of intermediate learning signals. Through various challenging experiments, we demonstrate that DGFS achieves more accurate estimates of the normalization constant than closely-related prior methods.

2023-10-04

ArXiv (preprint)

arxiv.org

Double Gumbel Q-Learning.

David Yu-Tung Hui

Pierre-Luc Bacon

Group Robust Classification Without Any Group Information

Christos Tsirigotis

Joao Monteiro

Pau Rodriguez

David Vazquez

Improving Compositional Generalization using Iterated Learning and Simplicial Embeddings

Yi Ren

Samuel Lavoie

Mikhail Galkin

Danica J. Sutherland

Language Model Alignment with Elastic Reset

Michael Noukhovitch

Samuel Lavoie

Florian Strub

Finetuning language models with reinforcement learning (RL), e.g. from human feedback (HF), is a prominent method for alignment. But optimiz… (see more)ing against a reward model can improve on reward while degrading performance in other areas, a phenomenon known as reward hacking, alignment tax, or language drift. First, we argue that commonly-used test metrics are insufficient and instead measure how different algorithms tradeoff between reward and drift. The standard method modified the reward with a Kullback-Lieber (KL) penalty between the online and initial model. We propose Elastic Reset, a new algorithm that achieves higher reward with less drift without explicitly modifying the training objective. We periodically reset the online model to an exponentially moving average (EMA) of itself, then reset the EMA model to the initial model. Through the use of an EMA, our model recovers quickly after resets and achieves higher reward with less drift in the same number of steps. We demonstrate that fine-tuning language models with Elastic Reset leads to state-of-the-art performance on a small scale pivot-translation benchmark, outperforms all baselines in a medium-scale RLHF-like IMDB mock sentiment task and leads to a more performant and more aligned technical QA chatbot with LLaMA-7B. Code available at github.com/mnoukhov/elastic-reset.

Let the Flows Tell: Solving Graph Combinatorial Problems with GFlowNets

Dinghuai Zhang

Hanjun Dai

Nikolay Malkin

Yoshua Bengio

Ling Pan

Versatile Energy-Based Probabilistic Models for High Energy Physics

Taoli Cheng

Discovering the Electron Beam Induced Transition Rates for Silicon Dopants in Graphene with Deep Neural Networks in the STEM

Kevin M Roccapriore

Max Schwarzer

Joshua Greaves

Jesse Farebrother

Rishabh Agarwal

Colton Bishop

Maxim Ziatdinov

Igor Mordatch

Ekin Dogus Cubuk

Pablo Samuel Castro

Marc Gendron-Bellemare

Sergei V Kalinin

2023-07-22

Microscopy and Microanalysis (published)

Meta-Value Learning: a General Framework for Learning with Learning Awareness

Tim Cooijmans

Milad Aghajohari

2023-07-17

ArXiv (preprint)

Bigger, Better, Faster: Human-level Atari with human-level efficiency

Max Schwarzer

Johan Samir Obando Ceron

Marc Gendron-Bellemare

Rishabh Agarwal

Pablo Samuel Castro

We introduce a value-based RL agent, which we call BBF, that achieves super-human performance in the Atari 100K benchmark. BBF relies on sca… (see more)ling the neural networks used for value estimation, as well as a number of other design choices that enable this scaling in a sample-efficient manner. We conduct extensive analyses of these design choices and provide insights for future work. We end with a discussion about updating the goalposts for sample-efficient RL research on the ALE. We make our code and data publicly available at https://github.com/google-research/google-research/tree/master/bigger_better_faster.

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (published)

Learning with Learning Awareness using Meta-Values

Tim Cooijmans

Milad Aghajohari

2023-06-19

ICML.cc/2023/Workshop/Frontiers4LCD (published)