Reyhane Askari Hemmat

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

An Introduction to Vision-Language Modeling

Florian Bordes

Richard Yuanzhe Pang

Anurag Ajay

Alexander C. Li

Adrien Bardes

Suzanne Petryk

Oscar Mañas

Zhiqiu Lin

Anas Mahmoud

Bargav Jayaraman

Mark Ibrahim

Melissa Hall

Yunyang Xiong

Jonathan Lebensold

Candace Ross

Srihari Jayakumar

Chuan Guo

Diane Bouchacourt

Haider Al-Tahan

Karthik Padthe … (voir 21 de plus)

Vasu Sharma

Huijuan Xu 0001

Xiaoqing Ellen Tan

Megan Richards

Samuel Lavoie

Pietro Astolfi

Reyhane Askari Hemmat

Jun Chen

Kushal Tirumala

Rim Assouel

Mazda Moayeri

Arjang Talattof

Kamalika Chaudhuri

Zechun Liu

Xilun Chen

Quentin Garrido

Karen Ullrich

Aishwarya Agrawal

Kate Saenko

Asli Celikyilmaz

Vikas Chandra

Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (voir plus)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.

2024-05-27

ArXiv (prépublication)

doi.org

arxiv.org

Feedback-guided Data Synthesis for Imbalanced Classification

Reyhane Askari Hemmat

Mohammad Pezeshki

Florian Bordes

Michal Drozdzal

Adriana Romero Soriano

Current status quo in machine learning is to use static datasets of real images for training, which often come from long-tailed distribution… (voir plus)s. With the recent advances in generative models, researchers have started augmenting these static datasets with synthetic data, reporting moderate performance improvements on classification tasks. We hypothesize that these performance gains are limited by the lack of feedback from the classifier to the generative model, which would promote the usefulness of the generated samples to improve the classifier's performance. In this work, we introduce a framework for augmenting static datasets with useful synthetic samples, which leverages one-shot feedback from the classifier to drive the sampling of the generative model. In order for the framework to be effective, we find that the samples must be close to the support of the real data of the task at hand, and be sufficiently diverse. We validate three feedback criteria on a long-tailed dataset (ImageNet-LT) as well as a group-imbalanced dataset (NICO++). On ImageNet-LT, we achieve state-of-the-art results, with over 4 percent improvement on underrepresented classes while being twice efficient in terms of the number of generated synthetic samples. NICO++ also enjoys marked boosts of over 5 percent in worst group accuracy. With these results, our framework paves the path towards effectively leveraging state-of-the-art text-to-image models as data sources that can be queried to improve downstream applications.

2023-10-30

NeurIPS.cc/2023/Workshop/SyntheticData4ML (poster)

doi.org

openreview.net

LEAD: Min-Max Optimization from a Physical Perspective

Reyhane Askari Hemmat

Amartya Mitra

Guillaume Lajoie

Ioannis Mitliagkas

Adversarial formulations have rekindled interest in two-player min-max games. A central obstacle in the optimization of such games is the ro… (voir plus)tational dynamics that hinder their convergence. In this paper, we show that game optimization shares dynamic properties with particle systems subject to multiple forces, and one can leverage tools from physics to improve optimization dynamics. Inspired by the physical framework, we propose LEAD, an optimizer for min-max games. Next, using Lyapunov stability theory from dynamical systems as well as spectral analysis, we study LEAD’s convergence properties in continuous and discrete time settings for a class of quadratic min-max games to demonstrate linear convergence to the Nash equilibrium. Finally, we empirically evaluate our method on synthetic setups and CIFAR-10 image generation to demonstrate improvements in GAN training.

2023-01-01

Trans. Mach. Learn. Res. (publié)

openreview.net

Negative Momentum for Improved Game Dynamics

Gauthier Gidel

Reyhane Askari Hemmat

Mohammad Pezeshki

Gabriel Huang

Rémi LE PRIOL

Simon Lacoste-Julien

Ioannis Mitliagkas

Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (voir plus)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.

2019-04-11

Proceedings of the Twenty-Second International Conference on Artificial Intelligence and Statistics (publié)

proceedings.mlr.press

arxiv.org

Negative Momentum for Improved Game Dynamics

Gauthier Gidel

Reyhane Askari Hemmat

Mohammad Pezeshki

Gabriel Huang

Rémi LE PRIOL

Simon Lacoste-Julien

Ioannis Mitliagkas

Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiab… (voir plus)le games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs.

2018-07-12

ArXiv (preprint)

arxiv.org

Conférence sur les politiques de l'IA de Mila

À l’avant-garde d’une nouvelle ère

TRAIL : IA responsable pour les professionnels et les leaders

Reyhane Askari Hemmat

Publications

Conférence sur les politiques de l'IA de Mila

À l’avant-garde d’une nouvelle ère

TRAIL : IA responsable pour les professionnels et les leaders

Mots-clés populaires:

Reyhane Askari Hemmat

Publications