How to Get Your LLM to Generate Challenging Problems for Evaluation
The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.
OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes
Rhiannon Hendley
Alex Hernandez-Garcia
Sun Sun
Alain Tchagang
Jiang Su
Samuel Huberman
Hongyu Guo
Homin Shin
Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (voir plus) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of
OBELiX: A Curated Dataset of Crystal Structures and Experimentally Measured Ionic Conductivities for Lithium Solid-State Electrolytes
F'elix Therrien
Rhiannon Hendley
Alex Hern'andez-Garc'ia
Sun Sun
Alain Tchagang
Jiang Su
Samuel Huberman
Hongyu Guo
Homin Shin
Solid-state electrolyte batteries are expected to replace liquid electrolyte lithium-ion batteries in the near future thanks to their higher… (voir plus) theoretical energy density and improved safety. However, their adoption is currently hindered by their lower effective ionic conductivity, a quantity that governs charge and discharge rates. Identifying highly ion-conductive materials using conventional theoretical calculations and experimental validation is both time-consuming and resource-intensive. While machine learning holds the promise to expedite this process, relevant ionic conductivity and structural data is scarce. Here, we present OBELiX, a domain-expert-curated database of
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Shaona Ghosh
Heather Frase
Adina Williams
Sarah Luger
Paul Rottger
Fazl Barez
Sean McGregor
Kenneth Fricklas
Mala Kumar
Quentin Feuillade--Montixi
Kurt Bollacker
Felix Friedrich
Ryan Tsang
Bertie Vidgen
Alicia Parrish
Chris Knotz
Eleonora Presani
Jonathan Bennion
Marisa Ferrara Boston
Mike Kuniavsky … (voir 81 de plus)
Wiebke Hutiri
James Ezick
Malek Ben Salem
Rajat Sahay
Sujata Goswami
Usman Gohar
Ben Huang
Supheakmungkol Sarin
Elie Alhajjar
Canyu Chen
Roman Eng
K. Manjusha
Virendra Mehta
Eileen Peters Long
Murali Krishna Emani
Natan Vidra
Benjamin Rukundo
Abolfazl Shahbazi
Kongtao Chen
Rajat Ghosh
Vithursan Thangarasa
Pierre Peign'e
Abhinav Singh
Max Bartolo
Satyapriya Krishna
Mubashara Akhtar
Rafael Gold
Cody Coleman
Luis Oala
Vassil Tashev
Joseph Marvin Imperial
Amy Russ
Sasidhar Kunapuli
Nicolas Miailhe
Julien Delaunay
Bhaktipriya Radharapu
Rajat Shinde
Tuesday
Debojyoti Dutta
Declan Grabb
Ananya Gangavarapu
Saurav Sahay
Agasthya Gangavarapu
Patrick Schramowski
Stephen Singam
Tom David
Xudong Han
Priyanka Mary Mammen
Tarunima Prabhakar
Venelin Kovatchev
Ahmed M. Ahmed
Kelvin Manyeki
Sandeep Madireddy
Fedor Zhdanov
Joachim Baumann
N. Vasan
Xianjun Yang
Carlos Mougn
Jibin Rajan Varghese
Hussain Chinoy
Seshakrishna Jitendar
Manil Maskey
Claire V. Hardgrove
Tianhao Li
Aakash Gupta
Emil Joswin
Yifan Mai
Shachi H. Kumar
Çigdem Patlak
Kevin Lu
Vincent Alessi
Sree Bhargavi Balija
Chenhe Gu
Robert Sullivan
James Gealy
Matt Lavrisa
James Goel
Peter Mattson
Percy Liang
Joaquin Vanschoren
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
Shaona Ghosh
Heather Frase
Adina Williams
Sarah Luger
Paul Rottger
Fazl Barez
Sean McGregor
Kenneth Fricklas
Mala Kumar
Quentin Feuillade--Montixi
Kurt Bollacker
Felix Friedrich
Ryan Tsang
Bertie Vidgen
Alicia Parrish
Chris Knotz
Eleonora Presani
Jonathan Bennion
Marisa Ferrara Boston
Mike Kuniavsky … (voir 81 de plus)
Wiebke Hutiri
James Ezick
Malek Ben Salem
Rajat Sahay
Sujata Goswami
Usman Gohar
Ben Huang
Supheakmungkol Sarin
Elie Alhajjar
Canyu Chen
Roman Eng
K. Manjusha
Virendra Mehta
Eileen Peters Long
Murali Krishna Emani
Natan Vidra
Benjamin Rukundo
Abolfazl Shahbazi
Kongtao Chen
Rajat Ghosh
Vithursan Thangarasa
Pierre Peign'e
Abhinav Singh
Max Bartolo
Satyapriya Krishna
Mubashara Akhtar
Rafael Gold
Cody Coleman
Luis Oala
Vassil Tashev
Joseph Marvin Imperial
Amy Russ
Sasidhar Kunapuli
Nicolas Miailhe
Julien Delaunay
Bhaktipriya Radharapu
Rajat Shinde
Tuesday
Debojyoti Dutta
D. Grabb
Ananya Gangavarapu
Saurav Sahay
Agasthya Gangavarapu
Patrick Schramowski
Stephen Singam
Tom David
Xudong Han
Priyanka Mary Mammen
Tarunima Prabhakar
Venelin Kovatchev
Ahmed M. Ahmed
Kelvin Manyeki
Sandeep Madireddy
Fedor Zhdanov
Joachim Baumann
N. Vasan
Xianjun Yang
Carlos Mougn
Jibin Rajan Varghese
Hussain Chinoy
Seshakrishna Jitendar
Manil Maskey
Claire V. Hardgrove
Tianhao Li
Aakash Gupta
Emil Joswin
Yifan Mai
Shachi H. Kumar
Çigdem Patlak
Kevin Lu
Vincent Alessi
Sree Bhargavi Balija
Chenhe Gu
Robert Sullivan
James Gealy
Matt Lavrisa
James Goel
Peter Mattson
Percy Liang
Joaquin Vanschoren
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
Myra Cheng
Su Lin Blodgett
Alicia DeVrio
Lisa Egede
As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also increasingly raised co… (voir plus)ncerns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourcing study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
Myra Cheng
Su Lin Blodgett
Alicia DeVrio
Lisa Egede
As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing conc… (voir plus)erns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.
Dehumanizing Machines: Mitigating Anthropomorphic Behaviors in Text Generation Systems
Myra Cheng
Su Lin Blodgett
Alicia DeVrio
Lisa Egede
As text generation systems' outputs are increasingly anthropomorphic -- perceived as human-like -- scholars have also raised increasing conc… (voir plus)erns about how such outputs can lead to harmful outcomes, such as users over-relying or developing emotional dependence on these systems. How to intervene on such system outputs to mitigate anthropomorphic behaviors and their attendant harmful outcomes, however, remains understudied. With this work, we aim to provide empirical and theoretical grounding for developing such interventions. To do so, we compile an inventory of interventions grounded both in prior literature and a crowdsourced study where participants edited system outputs to make them less human-like. Drawing on this inventory, we also develop a conceptual framework to help characterize the landscape of possible interventions, articulate distinctions between different types of interventions, and provide a theoretical basis for evaluating the effectiveness of different interventions.
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen
Isaac Chung
Márton Kardos
Ashwin Mathur
David Stap
Jay Gala
Wissam Siblini
Dominik Krzemiński
Genta Indra Winata
Saba Sturua
Saiteja Utpala
Mathieu Ciancone
Marion Schaeffer
Gabriel Sequeira
Shreeya Dhakal
Jonathan Rystrøm
Roman Solomatin
Ömer Veysel Çağatan … (voir 66 de plus)
Akash Kundu
Martin Bernstorff
Shitao Xiao
Akshita Sukhlecha
Bhavish Pahwa
Rafał Poświata
Kranthi Kiran GV
Shawon Ashraf
Daniel Auras
Björn Plüster
Jan Philipp Harries
Loïc Magne
Isabelle Mohr
Mariya Hendriksen
Dawei Zhu
Hippolyte Gisserot-Boukhlef
Tom Aarsen
Jan Kostkan
Konrad Wojtasik
Taemin Lee
Marek Suppa
Crystina Zhang
Roberta Rocca
Mohammed Hamdy
Andrianos Michail
John Yang
Manuel Faysse
Aleksei Vatolin
Nandan Thakur
Manan Dey
Dipam Vasani
Pranjal A Chitale
Simone Tedeschi
Nguyen Tai
Artem Snegirev
Michael Günther
Mengzhou Xia
Weijia Shi
Jordan Clive
Gayatri K
Maksimova Anna
Silvan Wehrli
Maria Tikhonova
Henil Shalin Panchal
Aleksandr Abramov
Malte Ostendorff
Zheng Liu
Simon Clematide
Lester James Validad Miranda
Alena Fenogenova
Guangyu Song
Ruqiya Bin Safi
Wen-Ding Li
Alessia Borghini
Federico Cassano
Hongjin Su
Jimmy Lin
Howard Yen
Lasse Hansen
Sara Hooker
Chenghao Xiao
Orion Weller
Niklas Muennighoff
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (voir plus) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
Object-centric Binding in Contrastive Language-Image Pretraining
Pietro Astolfi
Michal Drozdzal
Recent advances in vision language models (VLM) have been driven by contrastive models such as CLIP, which learn to associate visual informa… (voir plus)tion with their corresponding text descriptions. However, these models have limitations in understanding complex compositional scenes involving multiple objects and their spatial relationships. To address these challenges, we propose a novel approach that diverges from commonly used strategies, which rely on the design of hard-negative augmentations. Instead, our work focuses on integrating inductive biases into pre-trained CLIP-like models to improve their compositional understanding without using any additional hard-negatives. To that end, we introduce a binding module that connects a scene graph, derived from a text description, with a slot-structured image representation, facilitating a structured similarity assessment between the two modalities. We also leverage relationships as text-conditioned visual constraints, thereby capturing the intricate interactions between objects and their contextual relationships more effectively. Our resulting model not only enhances the performance of CLIP-based models in multi-object compositional understanding but also paves the way towards more accurate and sample-efficient image-text matching of complex scenes.
Object-centric Binding in Contrastive Language-Image Pretraining
Pietro Astolfi
Michal Drozdzal
Making the Write Connections: Linking Writing Support Tools with Writer's Needs
Zixin Zhao
Young-Ho Kim
Gerald Penn
Fanny Chevalier
This work sheds light on whether and how creative writers' needs are met by existing research and commercial writing support tools (WST). We… (voir plus) conducted a need finding study to gain insight into the writers' process during creative writing through a qualitative analysis of the response from an online questionnaire and Reddit discussions on r/Writing. Using a systematic analysis of 115 tools and 67 research papers, we map out the landscape of how digital tools facilitate the writing process. Our triangulation of data reveals that research predominantly focuses on the writing activity and overlooks pre-writing activities and the importance of visualization. We distill 10 key takeaways to inform future research on WST and point to opportunities surrounding underexplored areas. Our work offers a holistic and up-to-date account of how tools have transformed the writing process, guiding the design of future tools that address writers' evolving and unmet needs.