Devon Hjelm

The Design Space of Tri-Modal Masked Diffusion Models

Louis Bethune

Victor Turrisi

Bruno Kacper Mlodozeniec

Pau Rodriguez Lopez

Lokesh Boominathan

Nikhil Bhendawade

Amitis Shidani

Joris Pelemans

Theo X. Olausson

Devon Hjelm

Paul Dixon

Joao Monteiro

Pierre Ablin

Vishnu Banna

Arno Blaas

Nick Henderson

Kari Noriy

Dan Busbridge

Josh Susskind

Marco Cuturi … (see 4 more)

Irina Belousova

Luca Zappella

Russ Webb

Jason Ramapuram

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuni… (see more)ng a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.

2025-12-31

arXiv (preprint)

doi.org

arxiv.org

Pretraining Reward-Free Representations for Data-Efficient Reinforcement Learning

Philip Bachman

2021-03-08

International Conference on Learning Representations (unknown)

openreview.net

Pretraining Representations for Data-Efficient Reinforcement Learning

Philip Bachman

Data efficiency is a key challenge for deep reinforcement learning. We address this problem by using unlabeled data to pretrain an encoder w… (see more)hich is then finetuned on a small amount of task-specific data. To encourage learning representations which capture diverse aspects of the underlying MDP, we employ a combination of latent dynamics modelling and unsupervised goal-conditioned RL. When limited to 100k steps of interaction on Atari games (equivalent to two hours of human experience), our approach significantly surpasses prior work combining offline representation pretraining with task-specific finetuning, and compares favourably with other pretraining methods that require orders of magnitude more data. Our approach shows particular promise when combined with larger models as well as more diverse, task-aligned observational data -- approaching human-level performance and data-efficiency on Atari in our best setting. We provide code associated with this work at https://github.com/mila-iqia/SGI.

2020-12-31

Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (published)

doi.org

openreview.net

Test Sample Accuracy Scales with Training Sample Density in Neural Networks

Xu Ji

Razvan Pascanu

Devon Hjelm

Andrea Vedaldi

Balaji Lakshminarayanan

Yoshua Bengio

Intuitively, one would expect accuracy of a trained neural network's prediction on test samples to correlate with how densely the samples ar… (see more)e surrounded by seen training samples in representation space. We find that a bound on empirical training error smoothed across linear activation regions scales inversely with training sample density in representation space. Empirically, we verify this bound is a strong predictor of the inaccuracy of the network's prediction on test samples. For unseen test sets, including those with out-of-distribution samples, ranking test samples by their local region's error bound and discarding samples with the highest bounds raises prediction accuracy by up to 20% in absolute terms for image classification datasets, on average over thresholds.

2020-12-31

arXiv.org (preprint)

doi.org

openreview.net

Tell, Draw, and Repeat: Generating and Modifying Images Based on Continual Linguistic Instruction

Alaaeldin El-Nouby

Shikhar Sharma

Hannes Schulz

Devon Hjelm

Layla El Asri

Samira Ebrahimi Kahou

Yoshua Bengio

Graham W. Taylor

Conditional text-to-image generation is an active area of research, with many possible applications. Existing research has primarily focused… (see more) on generating a single image from available conditioning information in one step. One practical extension beyond one-step generation is a system that generates an image iteratively, conditioned on ongoing linguistic input or feedback. This is significantly more challenging than one-step generation tasks, as such a system must understand the contents of its generated images with respect to the feedback history, the current feedback, as well as the interactions among concepts present in the feedback history. In this work, we present a recurrent image generation model which takes into account both the generated output up to the current step as well as all past instructions for generation. We show that our model is able to generate the background, add new objects, and apply simple transformations to existing objects. We believe our approach is an important step toward interactive generation. Code and data is available at: https://www.microsoft.com/en-us/research/project/generative-neural-visual-artist-geneva/ .

2019-11-01

2019 IEEE/CVF International Conference on Computer Vision (ICCV) (published)

doi.org

arxiv.org

Learning Generative Models with Locally Disentangled Latent Factors

Brady Neal