Portrait of Rim Assouel

Rim Assouel

PhD - Université de Montréal
Supervisor
Research Topics
Computer Vision
Reasoning
Representation Learning

Publications

Object-centric Binding in Contrastive Language-Image Pretraining
Pietro Astolfi
Michal Drozdzal
Object-centric Binding in Contrastive Language-Image Pretraining
Pietro Astolfi
Michal Drozdzal
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual informa… (see more)tion with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
Action abstractions for amortized sampling
Lena Nehale Ezzine
Joseph D Viviano
Michał Koziarski
Moksh J. Jain
The BrowserGym Ecosystem for Web Agent Research
Alexandre Lacoste
Massimo Caccia
Lawrence Keunho Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Russ Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
The BrowserGym Ecosystem for Web Agent Research
Alexandre Lacoste
Massimo Caccia
Lawrence Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
The BrowserGym Ecosystem for Web Agent Research
Alexandre Lacoste
Massimo Caccia
Lawrence Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
The BrowserGym Ecosystem for Web Agent Research
Alexandre Lacoste
Massimo Caccia
Lawrence Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.
OC-CLIP : Object-centric binding in Contrastive Language-Image Pretraining
Pietro Astolfi
Michal Drozdzal
Recent advancements in vision-language models (VLMs) have been driven by contrastive models like CLIP which learn to associate visual inform… (see more)ation with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from traditional data-centric methods of enhancing model performance with hard negatives examples. Our work instead focuses on integrating sufficient inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using additional data annotations. We introduce a binding module that connects a scene graph of the text with an induced graph-like representation of the image, facilitating a structured similarity assessment. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model (OC-CLIP) not only enhances the performance of CLIP in multi-object compositional understanding but also paves the way for more accurate and efficient image-text matching in complex scenes.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 21 more)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 21 more)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 21 more)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
An Introduction to Vision-Language Modeling
Richard Yuanzhe Pang
Anurag Ajay
Alexander C. Li
Adrien Bardes
Suzanne Petryk
Zhiqiu Lin
Bargav Jayaraman
Mark Ibrahim
Melissa Hall
Yunyang Xiong
Candace Ross
Srihari Jayakumar
Chuan Guo
Diane Bouchacourt
Haider Al-Tahan
Karthik Padthe … (see 21 more)
Vasu Sharma
Huijuan Xu 0001
Xiaoqing Ellen Tan
Megan Richards
Samuel Lavoie
Pietro Astolfi
Jun Chen
Kushal Tirumala
Mazda Moayeri
Arjang Talattof
Kamalika Chaudhuri
Zechun Liu
Xilun Chen
Quentin Garrido
Karen Ullrich
Kate Saenko
Asli Celikyilmaz
Vikas Chandra
Following the recent popularity of Large Language Models (LLMs), several attempts have been made to extend them to the visual domain. From h… (see more)aving a visual assistant that could guide us through unfamiliar environments to generative models that produce images using only a high-level text description, the vision-language model (VLM) applications will significantly impact our relationship with technology. However, there are many challenges that need to be addressed to improve the reliability of those models. While language is discrete, vision evolves in a much higher dimensional space in which concepts cannot always be easily discretized. To better understand the mechanics behind mapping vision to language, we present this introduction to VLMs which we hope will help anyone who would like to enter the field. First, we introduce what VLMs are, how they work, and how to train them. Then, we present and discuss approaches to evaluate VLMs. Although this work primarily focuses on mapping images to language, we also discuss extending VLMs to videos.