Vaibhav Adlakha

BiXSE: Improving Dense Retrieval via Probabilistic Graded Relevance Distillation

Joao Monteiro

Neural sentence embedding models for dense retrieval typically rely on binary relevance labels, treating query-document pairs as either rele… (see more)vant or irrelevant. However, real-world relevance often exists on a continuum, and recent advances in large language models (LLMs) have made it feasible to scale the generation of fine-grained graded relevance labels. In this work, we propose \textbf{BiXSE}, a simple and effective pointwise training method that optimizes binary cross-entropy (BCE) over LLM-generated graded relevance scores. BiXSE interprets these scores as probabilistic targets, enabling granular supervision from a single labeled query-document pair per query. Unlike pairwise or listwise losses that require multiple annotated comparisons per query, BiXSE achieves strong performance with reduced annotation and compute costs by leveraging in-batch negatives. Extensive experiments across sentence embedding (MMTEB) and retrieval benchmarks (BEIR, TREC-DL) show that BiXSE consistently outperforms softmax-based contrastive learning (InfoNCE), and matches or exceeds strong pairwise ranking baselines when trained on LLM-supervised data. BiXSE offers a robust, scalable alternative for training dense retrieval models as graded relevance supervision becomes increasingly accessible.

2025-07-06

colmweb.org/COLM/2025/Conference (accepted)

openreview.net

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Milad Aghajohari

Parishad BehnamGhader

Amirhossein Kazemnejad

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (see more)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

2025-04-01

ArXiv (preprint)

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Gabriel Sequeira

Diganta Misra

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Çağatan … (see 66 more)

Akash Kundu

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Mariya Hendriksen

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Šuppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Manan Dey

Dipam Vasani

Pranjal Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri Krishnakumar

Anna Maksimova

Silvan Wehrli

Maria Tikhonova

Henil Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Orion Weller

Niklas Muennighoff

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

2025-01-21

ICLR.cc/2025/Conference (poster)

openreview.net

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders

Parishad BehnamGhader

Large decoder-only language models (LLMs) are the state-of-the-art models on most of today's NLP tasks and benchmarks. Yet, the community is… (see more) only slowly adopting these models for text embedding tasks, which require rich contextualized representations. In this work, we introduce LLM2Vec, a simple unsupervised approach that can transform any decoder-only LLM into a strong text encoder. LLM2Vec consists of three simple steps: 1) enabling bidirectional attention, 2) masked next token prediction, and 3) unsupervised contrastive learning. We demonstrate the effectiveness of LLM2Vec by applying it to 4 popular LLMs ranging from 1.3B to 8B parameters and evaluate the transformed models on English word- and sequence-level tasks. We outperform encoder-only models by a large margin on word-level tasks and reach a new unsupervised state-of-the-art performance on the Massive Text Embeddings Benchmark (MTEB). Moreover, when combining LLM2Vec with supervised contrastive learning, we achieve state-of-the-art performance on MTEB among models that train only on publicly available data (as of May 24, 2024). Our strong empirical results and extensive analysis demonstrate that LLMs can be effectively transformed into universal text encoders in a parameter-efficient manner without the need for expensive adaptation or synthetic GPT-4 generated data.

2024-07-09

colmweb.org/COLM/2024/Conference (accepted)

openreview.net

Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering

Parishad BehnamGhader

Xing Han Lu

Nicholas Meade

Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as … (see more)question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa

2023-12-31

Transactions of the Association for Computational Linguistics (published)

Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for makin… (see more)g a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.

2022-11-30

Findings of the Association for Computational Linguistics: EMNLP 2022 (published)

Image Retrieval from Contextual Descriptions

Vibhav Vineet

Edoardo Ponti

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utte… (see more)rance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.As such, each description contains only the details that help distinguish between images.Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames.We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (published)

TopiOCQA: Open-domain Conversational Question Answering with Topic Switching

Shehzaad Dhuliawala

Kaheer Suleman

Harm de Vries