Publications

PairBench: Are Vision-Language Models Reliable at Comparing What They See?

Sai Rajeswar

Valentina Zantedeschi

Joao Monteiro

Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fu… (see more)ndamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for more complex tasks, while providing additional metrics into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation depending on the task.

2025-02-21

ArXiv (preprint)

arxiv.org

PairBench: Are Vision-Language Models Reliable at Comparing What They See?

Aarash Feizi

Sai Rajeswar

Adriana Romero Soriano

Reihaneh Rabbany

Valentina Zantedeschi

Spandana Gella

Joao Monteiro

Understanding how effectively large vision language models (VLMs) compare visual inputs is crucial across numerous applications, yet this fu… (see more)ndamental capability remains insufficiently assessed. While VLMs are increasingly deployed for tasks requiring comparative judgment, including automated evaluation, re-ranking, and retrieval-augmented generation, no systematic framework exists to measure their performance in these scenarios. We present PairBench, a simple framework that evaluates VLMs as customizable similarity tools using widely available image datasets. Our approach introduces four key metrics for reliable comparison: alignment with human annotations, consistency across pair ordering, distribution smoothness, and controllability through prompting. Our analysis reveals that no model consistently excels across all metrics, with each demonstrating distinct strengths and weaknesses. Most concerning is the widespread inability of VLMs to maintain symmetric similarity scores. Interestingly, we demonstrate that performance on our benchmark strongly correlates with popular benchmarks used for more complex tasks, while providing additional metrics into controllability, smoothness and ordering. This makes PairBench a unique and comprehensive framework to evaluate the performance of VLMs for automatic evaluation depending on the task.

2025-02-21

ArXiv (preprint)

arxiv.org

PairBench: A Systematic Framework for Selecting Reliable Judge VLMs

Aarash Feizi

Sai Rajeswar

Adriana Romero Soriano

Reihaneh Rabbany

Spandana Gella

Valentina Zantedeschi

Joao Monteiro

As large vision language models (VLMs) are increasingly used as automated evaluators, understanding their ability to effectively compare dat… (see more)a pairs as instructed in the prompt becomes essential. To address this, we present PairBench, a low-cost framework that systematically evaluates VLMs as customizable similarity tools across various modalities and scenarios. Through PairBench, we introduce four metrics that represent key desiderata of similarity scores: alignment with human annotations, consistency for data pairs irrespective of their order, smoothness of similarity distributions, and controllability through prompting. Our analysis demonstrates that no model, whether closed- or open-source, is superior on all metrics; the optimal choice depends on an auto evaluator's desired behavior (e.g., a smooth vs. a sharp judge), highlighting risks of widespread adoption of VLMs as evaluators without thorough assessment. For instance, the majority of VLMs struggle with maintaining symmetric similarity scores regardless of order. Additionally, our results show that the performance of VLMs on the metrics in PairBench closely correlates with popular benchmarks, showcasing its predictive power in ranking models.

2025-02-21

ArXiv (preprint)

arxiv.org

Harnessing artificial intelligence to fill global shortfalls in biodiversity knowledge

Laura J. Pollock

Justin Kitzes

Sara Beery

Kaitlyn M. Gaynor

Marta A. Jarzyna

Oisin Mac Aodha

Bernd Meyer

David Rolnick

Graham W. Taylor

Devis Tuia

Tanya Berger-Wolf

2025-02-20

Nature Reviews Biodiversity (published)

doi.org

OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes

Félix Therrien

Jamal Abou Haibeh

Divya Sharma

Rhiannon Hendley

Alex Hernandez-Garcia

Sun Sun

Alain Tchagang

Jiang Su

Samuel Huberman

Yoshua Bengio

Hongyu Guo

Homin Shin

Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (see more) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of

2025-02-20

ArXiv (preprint)

doi.org

arxiv.org

OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes

F'elix Therrien

Jamal Abou Haibeh

Divya Sharma

Rhiannon Hendley

Alex Hern'andez-Garc'ia

Sun Sun

Alain Tchagang

Jiang Su

Samuel Huberman

Yoshua Bengio

Hongyu Guo

Homin Shin

Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (see more) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of

2025-02-20

ArXiv (preprint)

arxiv.org

OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes

Félix Therrien

Jamal Abou Haibeh

Divya Sharma

Rhiannon Hendley

Alex Hernandez-Garcia

Sun Sun

Alain Tchagang

Jiang Su

Samuel Huberman

Yoshua Bengio

Hongyu Guo

Homin Shin

Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (see more) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a database of

2025-02-20

ArXiv (preprint)

doi.org

arxiv.org

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Shaona Ghosh

Heather Frase

Adina Williams

Sarah Luger

Paul Rottger

Fazl Barez

Sean McGregor

Kenneth Fricklas

Mala Kumar

Quentin Feuillade--Montixi

Kurt Bollacker

Felix Friedrich

Ryan Tsang

Bertie Vidgen

Alicia Parrish

Chris Knotz

Eleonora Presani

Jonathan Bennion

Marisa Ferrara Boston

Mike Kuniavsky … (see 81 more)

Wiebke Hutiri

James Ezick

Malek Ben Salem

Rajat Sahay

Sujata Goswami

Usman Gohar

Ben Huang

Supheakmungkol Sarin

Elie Alhajjar

Canyu Chen

Roman Eng

K. Manjusha

Virendra Mehta

Eileen Peters Long

Murali Krishna Emani

Natan Vidra

Benjamin Rukundo

Abolfazl Shahbazi

Kongtao Chen

Rajat Ghosh

Vithursan Thangarasa

Pierre Peign'e

Abhinav Singh

Max Bartolo

Satyapriya Krishna

Mubashara Akhtar

Rafael Gold

Cody Coleman

Luis Oala

Vassil Tashev

Joseph Marvin Imperial

Amy Russ

Sasidhar Kunapuli

Nicolas Miailhe

Julien Delaunay

Bhaktipriya Radharapu

Rajat Shinde

Tuesday

Debojyoti Dutta

D. Grabb

Ananya Gangavarapu

Saurav Sahay

Agasthya Gangavarapu

Patrick Schramowski

Stephen Singam

Tom David

Xudong Han

Priyanka Mary Mammen

Tarunima Prabhakar

Venelin Kovatchev

Ahmed M. Ahmed

Kelvin Manyeki

Sandeep Madireddy

Foutse Khomh

Fedor Zhdanov

Joachim Baumann

N. Vasan

Xianjun Yang

Carlos Mougn

Jibin Rajan Varghese

Hussain Chinoy

Seshakrishna Jitendar

Manil Maskey

Claire V. Hardgrove

Tianhao Li

Aakash Gupta

Emil Joswin

Yifan Mai

Shachi H. Kumar

Çigdem Patlak

Kevin Lu

Vincent Alessi

Sree Bhargavi Balija

Chenhe Gu

Robert Sullivan

James Gealy

Matt Lavrisa

James Goel

Peter Mattson

Percy Liang

Joaquin Vanschoren

2025-02-19

ArXiv (preprint)

arxiv.org

AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons

Shaona Ghosh

Heather Frase

Adina Williams

Sarah Luger

Paul Rottger

Fazl Barez

Sean McGregor

Kenneth Fricklas

Mala Kumar

Quentin Feuillade--Montixi

Kurt Bollacker

Felix Friedrich

Ryan Tsang

Bertie Vidgen

Alicia Parrish

Chris Knotz

Eleonora Presani

Jonathan Bennion

Marisa Ferrara Boston

Mike Kuniavsky … (see 81 more)

Wiebke Hutiri

James Ezick

Malek Ben Salem

Rajat Sahay

Sujata Goswami

Usman Gohar

Ben Huang

Supheakmungkol Sarin

Elie Alhajjar

Canyu Chen

Roman Eng

K. Manjusha

Virendra Mehta

Eileen Peters Long

Murali Krishna Emani

Natan Vidra

Benjamin Rukundo

Abolfazl Shahbazi

Kongtao Chen

Rajat Ghosh

Vithursan Thangarasa

Pierre Peign'e

Abhinav Singh

Max Bartolo

Satyapriya Krishna

Mubashara Akhtar

Rafael Gold

Cody Coleman

Luis Oala

Vassil Tashev

Joseph Marvin Imperial

Amy Russ

Sasidhar Kunapuli

Nicolas Miailhe

Julien Delaunay

Bhaktipriya Radharapu

Rajat Shinde

Tuesday

Debojyoti Dutta

Declan Grabb

Ananya Gangavarapu

Saurav Sahay

Agasthya Gangavarapu

Patrick Schramowski

Stephen Singam

Tom David

Xudong Han

Priyanka Mary Mammen

Tarunima Prabhakar

Venelin Kovatchev

Ahmed M. Ahmed

Kelvin Manyeki

Sandeep Madireddy

Foutse Khomh

Fedor Zhdanov

Joachim Baumann

N. Vasan

Xianjun Yang

Carlos Mougn

Jibin Rajan Varghese

Hussain Chinoy

Seshakrishna Jitendar

Manil Maskey

Claire V. Hardgrove

Tianhao Li

Aakash Gupta

Emil Joswin

Yifan Mai

Shachi H. Kumar

Çigdem Patlak

Kevin Lu

Vincent Alessi

Sree Bhargavi Balija

Chenhe Gu

Robert Sullivan

James Gealy

Matt Lavrisa

James Goel

Peter Mattson

Percy Liang

Joaquin Vanschoren

2025-02-19

ArXiv (preprint)

doi.org

arxiv.org

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

Myra Cheng

Su Lin Blodgett

Alicia DeVrio

Lisa Egede

Alexandra Olteanu

As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing conc… (see more)erns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

2025-02-19

ArXiv (preprint)

arxiv.org

Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems

Myra Cheng

Su Lin Blodgett

Alicia DeVrio

Lisa Egede

Alexandra Olteanu

As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also increasingly raised co… (see more)ncerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourcing study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.

2025-02-19

ArXiv (preprint)

doi.org

arxiv.org

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Gabriel Sequeira

Diganta Misra

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan … (see 66 more)

Akash Kundu

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Mariya Hendriksen

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Manan Dey

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Siva Reddy

Niklas Muennighoff

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

2025-02-19

ArXiv (preprint)

doi.org

arxiv.org

Speed Science

Leading in a New Era

Supervision Requests

Publications

Speed Science

Leading in a New Era

Supervision Requests

Popular keywords:

Publications