Samuel Lavoie

Simplicial Embeddings Improve Sample Efficiency in Actor-Critic Agents

Johan Samir Obando Ceron

Samuel Lavoie

Recent works have proposed accelerating the wall-clock training time of actor-critic methods via the use of large-scale environment parallel… (see more)ization; unfortunately, these can sometimes still require large number of environment interactions to achieve a desired level of performance. Noting that well-structured representations can improve the generalization and sample efficiency of deep reinforcement learning (RL) agents, we propose the use of simplicial embeddings: lightweight representation layers that constrain embeddings to simplicial structures. This geometric inductive bias results in sparse and discrete features that stabilize critic bootstrapping and strengthen policy gradients. When applied to FastTD3, FastSAC, and PPO, simplicial embeddings consistently improve sample efficiency and final performance across a variety of continuous- and discrete-control environments, without any loss in runtime speed.

2025-10-15

ArXiv (preprint)

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie

Michael Noukhovitch

We argue that diffusion models'success in modeling complex distributions is, for the most part, coming from their input conditioning. This p… (see more)aper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

2025-07-16

ArXiv (preprint)

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie

Michael Noukhovitch

We argue that diffusion models'success in modeling complex distributions is, for the most part, coming from their input conditioning. This p… (see more)aper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

2025-07-16

ArXiv (preprint)

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models

Samuel Lavoie

Michael Noukhovitch

We argue that diffusion models'success in modeling complex distributions is, for the most part, coming from their input conditioning. This p… (see more)aper investigates the representation used to condition diffusion models from the perspective that ideal representations should improve sample fidelity, be easy to generate, and be compositional to allow out-of-training samples generation. We introduce Discrete Latent Code (DLC), an image representation derived from Simplicial Embeddings trained with a self-supervised learning objective. DLCs are sequences of discrete tokens, as opposed to the standard continuous image embeddings. They are easy to generate and their compositionality enables sampling of novel images beyond the training distribution. Diffusion models trained with DLCs have improved generation fidelity, establishing a new state-of-the-art for unconditional image generation on ImageNet. Additionally, we show that composing DLCs allows the image generator to produce out-of-distribution samples that coherently combine the semantics of images in diverse ways. Finally, we showcase how DLCs can enable text-to-image generation by leveraging large-scale pretrained language models. We efficiently finetune a text diffusion language model to generate DLCs that produce novel samples outside of the image generator training distribution.

2025-07-01

arXiv (published)

Modeling Caption Diversity in Contrastive Vision-Language Pretraining

Samuel Lavoie

Polina Kirichenko

Mark Ibrahim

Mahmoud Assran

Andrew Gordon Wilson

Nicolas Ballas

There are a thousand ways to caption an image. Contrastive Language Pretraining (CLIP) on the other hand, works by mapping an image and its … (see more)caption to a single vector -- limiting how well CLIP-like models can represent the diverse ways to describe an image. In this work, we introduce Llip, Latent Language Image Pretraining, which models the diversity of captions that could match an image. Llip's vision encoder outputs a set of visual features that are mixed into a final representation by conditioning on information derived from the text. We show that Llip outperforms non-contextualized baselines like CLIP and SigLIP on a variety of tasks even with large-scale encoders. Llip improves zero-shot classification by an average of 2.9\% zero-shot classification benchmarks with a ViT-G/14 encoder. Specifically, Llip attains a zero-shot top-1 accuracy of 83.5\% on ImageNet outperforming a similarly sized CLIP by 1.4\%. We also demonstrate improvement on zero-shot retrieval on MS-COCO by 6.0\%. We provide a comprehensive analysis of the components introduced by the method and demonstrate that Llip leads to richer visual representations.

2024-07-08

Proceedings of the 41st International Conference on Machine Learning (published)

openreview.net

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

2024-05-27

ArXiv (preprint)

An Introduction to Vision-Language Modeling

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (see 21 more)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Jun Chen

Kushal Tirumala

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (preprint)