Diganta Misra

Massimo Caccia

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent v… (see more)ersion updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

2025-07-16

ArXiv (preprint)

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Victor May

Eilif Benjamin Muller

Massimo Caccia

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent v… (see more)ersion updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

2025-07-16

ArXiv (preprint)

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Victor May

Eilif Benjamin Muller

Massimo Caccia

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent v… (see more)ersion updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

2025-07-16

ArXiv (preprint)

GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Victor May

Eilif Benjamin Muller

Massimo Caccia

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent v… (see more)ersion updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

2025-07-01

arXiv (published)

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Gabriel Sequeira

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan … (see 66 more)

Akash Kundu

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Mariya Hendriksen

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Manan Dey

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Niklas Muennighoff

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

2025-02-19

ArXiv (preprint)

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan

Akash Kundu … (see 62 more)

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Manan Dey

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Mariya Hendriksen

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Niklas Muennighoff

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Gabriel Sequeira

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan … (see 66 more)

Akash Kundu

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Mariya Hendriksen

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Manan Dey

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Niklas Muennighoff

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

GitChameleon: Unmasking the Version-Switching Capabilities of Code Generation Models

Nizar Islah

Justine Gehring

Eilif Benjamin Muller

Terry Yue Zhuo

Massimo Caccia

2024-11-01

arXiv (published)

Challenging Common Assumptions about Catastrophic Forgetting and Knowledge Accumulation

Timothee LESORT

Pau Rodriguez

Md Rifat Arefin

2023-11-20

Proceedings of The 2nd Conference on Lifelong Learning Agents (published)

proceedings.mlr.press

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal Md Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Ray

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang … (see 432 more)

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Askell

Amanda Dsouza

Ambrose Slone

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew M. Dai

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubarajan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karakaş

B. Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen Lin

Blake Howald

Bryan Orinion

Cameron Diao

Cameron Dour

Catherine Stinson

Cedrick Argueta

Cesar Ferri

Chandan Singh

Charles Rathkopf

Chenlin Meng

Chitta Baral

Chiyu Wu

Chris Callison-Burch

Christopher Waites

Christian Voigt

Christopher D Manning

Christopher Potts

Cindy Ramirez

Clara E. Rivera

Clemencia Siro

Colin Raffel

Courtney Ashcraft

Cristina Garbacea

Damien Sileo

Dan Garrette

Dan Hendrycks

Dan Kilman

Dan Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

Daphne Ippolito

Dar Gilboa

David Dohan

David Drakard

David Jurgens

Debajyoti Datta

Deep Ganguli

Denis Emelin

Denis Kleyko

Deniz Yuret

Derek Chen

Derek Tam

Dieuwke Hupkes

Dilyar Buzan

Dimitri Coelho Mollo

Diyi Yang

Dong-Ho Lee

Dylan Schrader

Ekaterina Shutova

Ekin Dogus Cubuk

Elad Segal

Eleanor Hagerman

Elizabeth Barnes

Elizabeth Donoway

Ellie Pavlick

Emanuele Rodolá

Emma Lam

Eric Chu

Eric Tang

Erkut Erdem

Ernie Chang

Ethan A Chi

Ethan Dyer

Ethan Jerzak

Ethan Kim

Eunice Engefu Manyasi

Evgenii Zheltonozhskii

Fanyue Xia

Fatemeh Siar

Fernando Martínez-Plumed

Francesca Happé

Francois Chollet

Frieda Rong

Gaurav Mishra

Genta Indra Winata

Gerard de Melo

Germán Kruszewski

Giambattista Parascandolo

Giorgio Mariani

Gloria Xinyue Wang

Gonzalo Jaimovitch-Lopez

Gregor Betz

Guy Gur-Ari

Hana Galijasevic

Hannah Kim

Hannah Rashkin

Hannaneh Hajishirzi

Harsh Mehta

Hayden Bogar

Henry Shevlin

Henry Francis Anthony Shevlin

Hinrich Schuetze

Hiromu Yakura

Hongming Zhang

Hugh Mee Wong

Ian Ng

Isaac Noble

Jaap Jumelet

Jack Geissinger

Jackson Kernion

Jacob Hilton

Jaehoon Lee

Jaime Fernández Fisac

James B Simon

James Koppel

James Zheng

James Zou

Jan Kocon

Jana Thompson

Janelle Wingfield

Jared Kaplan

Jarema Radom

Jascha Sohl-Dickstein

Jason Phang

Jason Wei

Jason Yosinski

Jekaterina Novikova

Jelle Bosscher

Jennifer Marsh

Jeremy Kim

Jeroen Taal

Jesse Engel

Jesujoba Oluwadara Alabi

Jiacheng Xu

Jiaming Song

Jillian Tang

Joan Waweru

John Burden

John Miller

John U. Balis

Jonathan Batchelder

Jonathan Berant

Jörg Frohberg

Jos Rozen

Jose Hernandez-Orallo

Joseph Boudeman

Joseph Guerr

Joseph Jones

Joshua B. Tenenbaum

Joshua S. Rule

Joyce Chua

Kamil Kanclerz

Karen Livescu

Karl Krauth

Karthik Gopalakrishnan

Katerina Ignatyeva

Katja Markert

Kaustubh Dhole

Kevin Gimpel

Kevin Omondi

Kory Wallace Mathewson

Kristen Chiafullo

Ksenia Shkaruta

Kumar Shridhar

Kyle McDonell

Kyle Richardson

Laria Reynolds

Leo Gao

Ling Zhang

Liam Dugan

Lianhui Qin

Lidia Contreras-Ochando

Louis-Philippe Morency

Luca Moschella

Lucas Lam

Lucy Noble

Ludwig Schmidt

Luheng He

Luis Oliveros-Colón

Luke Metz

Lütfi Kerem Senel

Maarten Bosma

Maarten Sap

Maartje Ter Hoeve

Maheen Farooqi

Manaal Faruqui

Mantas Mazeika

Marco Baturan

Marco Marelli

Marco Maru

Maria Jose Ramirez-Quintana

Marie Tolkiehn

Mario Giulianelli

Martha Lewis

Martin Potthast

Matthew L Leavitt

Matthias Hagen

Mátyás Schubert

Medina Orduna Baitemirova

Melody Arnaud

Melvin McElrath

Michael Andrew Yee

Michael Cohen

Michael Gu

Michael Ivanitskiy

Michael Starritt

Michael Strube

Michał Swędrowski

Michele Bevilacqua

Michihiro Yasunaga

Mihir Kale

Mike Cain

Mimee Xu

Mirac Suzgun

Mitch Walker

Mo Tiwari

Mohit Bansal

Moin Aminnaseri

Mor Geva

Mozhdeh Gheini

Mukund Varma T

Nanyun Peng

Nathan Andrew Chi

Nayeon Lee

Neta Gur-Ari Krakover

Nicholas Cameron

Nicholas Roberts

Nick Doiron

Nicole Martinez

Nikita Nangia

Niklas Deckers

Niklas Muennighoff

Nitish Shirish Keskar

Niveditha S. Iyer

Noah Constant

Noah Fiedel

Nuan Wen

Oliver Zhang

Omar Agha

Omar Elbaghdadi

Omer Levy

Owain Evans

Pablo Antonio Moreno Casares

Parth Doshi

Pascale Fung

Paul Pu Liang

Paul Vicol

Pegah Alipoormolabashi

Peiyuan Liao

Percy Liang

Peter W Chang

Peter Eckersley

Phu Mon Htut

Pinyu Hwang

Pi-Bei Hwang

Piotr Miłkowski

Piyush Patil

Pouya Pezeshkpour

Priti Oli

Qiaozhu Mei

Qing Lyu

Qinlang Chen

Rabin Banjade

Rachel Etta Rudolph

Raefer Gabriel

Rahel Habacker

Ramon Risco

Raphaël Millière

Rhythm Garg

Richard Barnes

Rif A. Saurous

Riku Arakawa

Robbe Raymaekers

Robert Frank

Rohan Sikand

Roman Novak

Roman Sitelew

Ronan Le Bras

Rosanne Liu

Rowan Jacobs

Rui Zhang

Russ Salakhutdinov

Ryan Andrew Chi

Seungjae Ryan Lee

Ryan Stovall

Ryan Teehan

Rylan Yang

Sahib Singh

Saif Mohammad

Sajant Anand

Sam Dillavou

Sam Shleifer

Sam Wiseman

Samuel Gruetter

Samuel R. Bowman

Samuel Stern Schoenholz

Sanghyun Han

Sanjeev Kwatra

Sarah A. Rous

Sarik Ghazarian

Sayan Ghosh

Sean Casey

Sebastian Bischoff

Sebastian Gehrmann

Sebastian Schuster

Sepideh Sadeghi

Shadi Hamdan

Sharon Zhou

Shashank Srivastava

Sherry Shi

Shikhar Singh

Shima Asaadi

Shixiang Shane Gu

Shubh Pachchigar

Shubham Toshniwal

Shyam Upadhyay

Shyamolima Shammie Debnath

Siamak Shakeri

Simon Thormeyer

Simone Melzi

Sneha Priscilla Makini

Soo-Hwan Lee

Spencer Torene

Sriharsha Hatwar

Stanislas Dehaene

Stefan Divic

Stefano Ermon

Stella Biderman

Stephanie Lin

Stephen Prasad

Steven Piantadosi

Stuart Shieber

Summer Misherghi

Svetlana Kiritchenko

Swaroop Mishra

Tal Linzen

Tal Schuster

Tao Li

Tao Yu

Tariq Ali

Tatsunori Hashimoto

Te-Lin Wu

Théo Desbordes

Theodore Rothschild

Thomas Phan

Tianle Wang

Tiberius Nkinyili

Timo Schick

Timofei Kornev

Titus Tunduny

Tobias Gerstenberg

Trenton Chang

Trishala Neeraj

Tushar Khot

Tyler Shultz

Uri Shaham

Vedant Misra

Vera Demberg

Victoria Nyamai

Vikas Raunak

Vinay Venkatesh Ramasesh

vinay uday prabhu

Vishakh Padmakumar

Vivek Srikumar

William Fedus

William Saunders

William Zhang

Wout Vossen

Xiang Ren

Xiaoyu Tong

Xinran Zhao

Xinyi Wu

Xudong Shen

Yadollah Yaghoobzadeh

Yair Lakretz

Yangqiu Song

Yasaman Bahri

Yejin Choi

Yichi Yang

Sophie Hao

Yiding Hao

Yifu Chen

Yonatan Belinkov

Yufang Hou

Yuntao Bai

Zachary Seid

Zhuoye Zhao

Zijian Wang

Zijie J. Wang

Zirui Wang

Ziyi Wu

2023-05-11

TMLR (accepted)

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal Md Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Ray

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang … (see 432 more)

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Askell

Amanda Dsouza

Ambrose Slone

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew M. Dai

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubarajan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karakaş

B. Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen Lin

Blake Howald

Bryan Orinion

Cameron Diao

Cameron Dour

Catherine Stinson

Cedrick Argueta

Cesar Ferri

Chandan Singh

Charles Rathkopf

Chenlin Meng

Chitta Baral

Chiyu Wu

Chris Callison-Burch

Christopher Waites

Christian Voigt

Christopher D Manning

Christopher Potts

Cindy Ramirez

Clara E. Rivera

Clemencia Siro

Colin Raffel

Courtney Ashcraft

Cristina Garbacea

Damien Sileo

Dan Garrette

Dan Hendrycks

Dan Kilman

Dan Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

Daphne Ippolito

Dar Gilboa

David Dohan

David Drakard

David Jurgens

Debajyoti Datta

Deep Ganguli

Denis Emelin

Denis Kleyko

Deniz Yuret

Derek Chen

Derek Tam

Dieuwke Hupkes

Dilyar Buzan

Dimitri Coelho Mollo

Diyi Yang

Dong-Ho Lee

Dylan Schrader

Ekaterina Shutova

Ekin Dogus Cubuk

Elad Segal

Eleanor Hagerman

Elizabeth Barnes

Elizabeth Donoway

Ellie Pavlick

Emanuele Rodolá

Emma Lam

Eric Chu

Eric Tang

Erkut Erdem

Ernie Chang

Ethan A Chi

Ethan Dyer

Ethan Jerzak

Ethan Kim

Eunice Engefu Manyasi

Evgenii Zheltonozhskii

Fanyue Xia

Fatemeh Siar

Fernando Martínez-Plumed

Francesca Happé

Francois Chollet

Frieda Rong

Gaurav Mishra

Genta Indra Winata

Gerard de Melo

Germán Kruszewski

Giambattista Parascandolo

Giorgio Mariani

Gloria Xinyue Wang

Gonzalo Jaimovitch-Lopez

Gregor Betz

Guy Gur-Ari

Hana Galijasevic

Hannah Kim

Hannah Rashkin

Hannaneh Hajishirzi

Harsh Mehta

Hayden Bogar

Henry Francis Anthony Shevlin

Hinrich Schuetze

Hiromu Yakura

Hongming Zhang

Hugh Mee Wong

Ian Ng

Isaac Noble

Jaap Jumelet

Jack Geissinger

Jackson Kernion

Jacob Hilton

Jaehoon Lee

Jaime Fernández Fisac

James B Simon

James Koppel

James Zheng

James Zou

Jan Kocon

Jana Thompson

Janelle Wingfield

Jared Kaplan

Jarema Radom

Jascha Sohl-Dickstein

Jason Phang

Jason Wei

Jason Yosinski

Jekaterina Novikova

Jelle Bosscher

Jennifer Marsh

Jeremy Kim

Jeroen Taal

Jesse Engel

Jesujoba Oluwadara Alabi

Jiacheng Xu

Jiaming Song

Jillian Tang

Joan Waweru

John Burden

John Miller

John U. Balis

Jonathan Batchelder

Jonathan Berant

Jörg Frohberg

Jos Rozen

Jose Hernandez-Orallo

Joseph Boudeman

Joseph Guerr

Joseph Jones

Joshua B. Tenenbaum

Joshua S. Rule

Joyce Chua

Joyce Hui Ping Chua

Kamil Kanclerz

Karen Livescu

Karl Krauth

Karthik Gopalakrishnan

Katerina Ignatyeva

Katja Markert

Kaustubh Dhole

Kevin Gimpel

Kevin Omondi

Kory Wallace Mathewson

Kristen Chiafullo

Ksenia Shkaruta

Kumar Shridhar

Kyle McDonell

Kyle Richardson

Laria Reynolds

Leo Gao

Li Zhang

Liam Dugan

Lianhui Qin

Lidia Contreras-Ochando

Louis-Philippe Morency

Luca Moschella

Lucas Lam

Lucy Noble

Ludwig Schmidt

Luheng He

Luis Oliveros-Colón

Luke Metz

Lütfi Kerem Senel

Maarten Bosma

Maarten Sap

Maartje Ter Hoeve

Maheen Farooqi

Manaal Faruqui

Mantas Mazeika

Marco Baturan

Marco Marelli

Marco Maru

Maria Jose Ramirez-Quintana

Marie Tolkiehn

Mario Giulianelli

Martha Lewis

Martin Potthast

Matthew L Leavitt

Matthias Hagen

Mátyás Schubert

Medina Orduna Baitemirova

Melody Arnaud

Melvin McElrath

Michael Andrew Yee

Michael Cohen

Michael Gu

Michael Ivanitskiy

Michael Starritt

Michael Strube

Michał Swędrowski

Michele Bevilacqua

Michihiro Yasunaga

Mihir Kale

Mike Cain

Mimee Xu

Mirac Suzgun

Mitch Walker

Mo Tiwari

Mohit Bansal

Moin Aminnaseri

Mor Geva

Mozhdeh Gheini

Mukund Varma T

Nanyun Peng

Nathan Andrew Chi

Nayeon Lee

Neta Gur-Ari Krakover

Nicholas Cameron

Nicholas Roberts

Nick Doiron

Nicole Martinez

Nikita Nangia

Niklas Deckers

Niklas Muennighoff

Nitish Shirish Keskar

Niveditha S. Iyer

Noah Constant

Noah Fiedel

Nuan Wen

Oliver Zhang

Omar Agha

Omar Elbaghdadi

Omer Levy

Owain Evans

Pablo Antonio Moreno Casares

Parth Doshi

Pascale Fung

Paul Pu Liang

Paul Vicol

Pegah Alipoormolabashi

Peiyuan Liao

Percy Liang

Peter W Chang

Peter Eckersley

Phu Mon Htut

Pinyu Hwang

Pi-Bei Hwang

Piotr Miłkowski

Piyush Patil

Pouya Pezeshkpour

Priti Oli

Qiaozhu Mei

Qing Lyu

Qinlang Chen

Rabin Banjade

Rachel Etta Rudolph

Raefer Gabriel

Rahel Habacker

Ramon Risco

Raphaël Millière

Rhythm Garg

Richard Barnes

Rif A. Saurous

Riku Arakawa

Robbe Raymaekers

Robert Frank

Rohan Sikand

Roman Novak

Roman Sitelew

Ronan Le Bras

Rosanne Liu

Rowan Jacobs

Rui Zhang

Russ Salakhutdinov

Ryan Andrew Chi

Seungjae Ryan Lee

Ryan Stovall

Ryan Teehan

Rylan Yang

Sahib Singh

Saif Mohammad

Sajant Anand

Sam Dillavou

Sam Shleifer

Sam Wiseman

Samuel Gruetter

Samuel R. Bowman

Samuel Stern Schoenholz

Sanghyun Han

Sanjeev Kwatra

Sarah A. Rous

Sarik Ghazarian

Sayan Ghosh

Sean Casey

Sebastian Bischoff

Sebastian Gehrmann

Sebastian Schuster

Sepideh Sadeghi

Shadi Hamdan

Sharon Zhou

Shashank Srivastava

Sherry Shi

Shikhar Singh

Shima Asaadi

Shixiang Shane Gu

Shubh Pachchigar

Shubham Toshniwal

Shyam Upadhyay

Shyamolima Shammie Debnath

Siamak Shakeri

Simon Thormeyer

Simone Melzi

Sneha Priscilla Makini

Soo-Hwan Lee

Spencer Torene

Sriharsha Hatwar

Stanislas Dehaene

Stefan Divic

Stefano Ermon

Stella Biderman

Stephanie Lin

Stephen Prasad

Steven Piantadosi

Stuart Shieber

Summer Misherghi

Svetlana Kiritchenko

Swaroop Mishra

Tal Linzen

Tal Schuster

Tao Li

Tao Yu

Tariq Ali

Tatsunori Hashimoto

Te-Lin Wu

Théo Desbordes

Theodore Rothschild

Thomas Phan

Tianle Wang

Tiberius Nkinyili

Timo Schick

Timofei Kornev

Titus Tunduny

Tobias Gerstenberg

Trenton Chang

Trishala Neeraj

Tushar Khot

Tyler Shultz

Uri Shaham

Vedant Misra

Vera Demberg

Victoria Nyamai

Vikas Raunak

Vinay Venkatesh Ramasesh

vinay uday prabhu

Vishakh Padmakumar

Vivek Srikumar

William Fedus

William Saunders

William Zhang

Wout Vossen

Xiang Ren

Xiaoyu Tong

Xinran Zhao

Xinyi Wu

Xudong Shen

Yadollah Yaghoobzadeh

Yair Lakretz

Yangqiu Song

Yasaman Bahri

Yejin Choi

Yichi Yang

Yiding Hao

Yifu Chen

Yonatan Belinkov

Yu Hou

Yufang Hou

Yuntao Bai

Zachary Seid

Zhuoye Zhao

Zijian Wang

Zijie J. Wang

Zirui Wang

Ziyi Wu

Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially … (see more)transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG- bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood develop- ment, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google- internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting.

2023-05-11

TMLR (accepted)

APP: Anytime Progressive Pruning

Bharat Runwal

Tianlong Chen

Zhangyang Wang

With the latest advances in deep learning, several methods have been investigated for optimal learning settings in scenarios where the data … (see more)stream is continuous over time. However, training sparse networks in such settings has often been overlooked. In this paper, we explore the problem of training a neural network with a target sparsity in a particular case of online learning: the anytime learning at macroscale paradigm (ALMA). We propose a novel way of progressive pruning, referred to as \textit{Anytime Progressive Pruning} (APP); the proposed approach significantly outperforms the baseline dense and Anytime OSP models across multiple architectures and datasets under short, moderate, and long-sequence training. Our method, for example, shows an improvement in accuracy of

2022-11-17

ACML.org/2022/Workshop/CLL (published)