Manan Dey

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Siva Reddy

Niklas Muennighoff

Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.

2025-02-19

ArXiv (preprint)

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Diganta Misra

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan

Akash Kundu … (see 62 more)

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Mariya Hendriksen

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Siva Reddy

Niklas Muennighoff

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

MMTEB: Massive Multilingual Text Embedding Benchmark

Kenneth Enevoldsen

Isaac Chung

Imene Kerboua

Márton Kardos

Ashwin Mathur

David Stap

Jay Gala

Wissam Siblini

Dominik Krzemiński

Genta Indra Winata

Saba Sturua

Saiteja Utpala

Mathieu Ciancone

Marion Schaeffer

Gabriel Sequeira

Diganta Misra

Shreeya Dhakal

Jonathan Rystrøm

Roman Solomatin

Ömer Veysel Çağatan … (see 66 more)

Akash Kundu

Martin Bernstorff

Shitao Xiao

Akshita Sukhlecha

Bhavish Pahwa

Rafał Poświata

Kranthi Kiran GV

Shawon Ashraf

Daniel Auras

Björn Plüster

Jan Philipp Harries

Loïc Magne

Isabelle Mohr

Mariya Hendriksen

Dawei Zhu

Hippolyte Gisserot-Boukhlef

Tom Aarsen

Jan Kostkan

Konrad Wojtasik

Taemin Lee

Marek Suppa

Crystina Zhang

Roberta Rocca

Mohammed Hamdy

Andrianos Michail

John Yang

Manuel Faysse

Aleksei Vatolin

Nandan Thakur

Dipam Vasani

Pranjal A Chitale

Simone Tedeschi

Nguyen Tai

Artem Snegirev

Michael Günther

Mengzhou Xia

Weijia Shi

Xing Han Lu

Jordan Clive

Gayatri K

Maksimova Anna

Silvan Wehrli

Maria Tikhonova

Henil Shalin Panchal

Aleksandr Abramov

Malte Ostendorff

Zheng Liu

Simon Clematide

Lester James Validad Miranda

Alena Fenogenova

Guangyu Song

Ruqiya Bin Safi

Wen-Ding Li

Alessia Borghini

Federico Cassano

Hongjin Su

Jimmy Lin

Howard Yen

Lasse Hansen

Sara Hooker

Chenghao Xiao

Vaibhav Adlakha

Orion Weller

Siva Reddy

Niklas Muennighoff

Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.

2025-01-22

ICLR.cc/2025/Conference (poster)

openreview.net

On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models

Tianyang Zhao

Kunwar Yashraj Singh

Srikar Appalaraju

Peng Tang

Ying Nian Wu

Li Erran Li

Li

Rishabh Agarwal

Nino Vieillard

Yongchao Zhou

Piotr Stańczyk

Sabela Ramos Garea

Matthieu Geist

Rohan Anil

Andrew M. Dai

Melvin Orhan Firat

Dmitry Lepikhin

Alexandre Passos

Siamak Shakeri

Emanuel Taropa … (see 478 more)

Paige Bailey

Zhifeng Chen

Eric Chu

Jonathan H. Clark

Laurent El

Yanping Huang

K. Meier-Hellstern

Gaurav Mishra

Erica Moreira

Mark Omernick

Kevin Robinson

Sebastian Ruder

Yi Tay

Kefan Xiao

Yuanzhong Xu

Yujing Zhang

Gustavo Hernández Abrego

Junwhan Ahn

Jacob Austin

Paul R. Barham

Jan Botha

James Bradbury

Siddhartha Brahma

Kevin Brooks

M. Catasta

Yong Cheng

Colin Cherry

Christopher A. Choquette-Choo

Aakanksha Chowdhery

Clé-ment Crepy

Shachi Dave

Mostafa Dehghani

Sunipa Dev

Jacob Devlin

Mark Díaz

Nan Du

Ethan Dyer

Vladimir Feinberg

Fangxiaoyu Feng

Vlad Fienber

Markus Freitag

Xavier Garcia

Sebastian Gehrmann

Lucas Gonzalez

Guy Gur-Ari

Steven Hand

Hadi Hashemi

Le Hou

Joshua Howland

Andrea Hu

Jeffrey Hui

Jeremy Hur-witz

Michael Acheson Isard

Abe Ittycheriah

Matthew Jagiel-ski

Wenhao Jia

Kathleen Kenealy

M. Krikun

Sneha Kudugunta 0001

Chang Lan

Kather-ine Lee

Benjamin Lee

Music Eric Li

Wei Li

YaGuang Li

Li Jian

Hyeontaek Li

Hanzhao Lim

Zhongtao Lin

Liu Frederick

Marcello Liu

Aroma Maggioni

Mahendru Joshua

Vedant Maynez

Maysam Misra

Moussalem Zachary

John Nado

E. Nham

Andrew Ni

Alicia Nys-trom

Marie Parrish

M. Pellat

Polacek Alex

Reiner Polozov

Siyuan Pope

Emily Qiao

Reif Bryan

Parker Richter

Alex Riley

Castro Ros

Aurko Roy

Brennan Saeta

Rajkumar Samuel

Renee Shelby

Ambrose Slone

Daniel Smilkov

David R. So

Daniel Sohn

Simon Tokumine

Dasha Valter

Haim-ing Bao

Mo Bavarian

Jeff Belgum

Ir-wan Bello

Jake Berdine

Gabriel Bernadett-Shapiro

Christopher Berner

Lenny Bogdonoff

Oleg Boiko

Madelaine Boyd

Anna-Luisa Brakman

Greg Brock-man

Tim Brooks

M. Brundage

Kevin Button

Trevor Cai

Rosie Campbell

Andrew Cann

Brittany Carey

Chelsea Carlson

Rory Carmichael

Brooke Chan

Che Chang

Fotis Chantzis

Derek Chen

Sully Chen

Ruby Chen

Jason Chen

Mark Chen

Benjamin Chess

Chester Cho

Hyung Casey Chu

Won Chung

Dave Cummings

Jeremiah Currier

Yunxing Dai

Tarun Goel

Gabriel Gogineni

Rapha Goh

Jonathan Gontijo-Lopes

Morgan Gordon

Scott Grafstein

Ryan Gray

Joshua Greene

Shixiang Shane Gross

Yufei Gu

Chris Guo

Jesse Hallacy

Jeff Han

Harris Yuchen

Mike He

Johannes Heaton

C. Heidecke

Alan Hesse

Wade Hickey

Peter Hickey

Hoeschele Brandon

Kenny Houghton

Shengli Hsu

Xin Hu

Joost Hu

Shantanu Huizinga

Shawn Jain

Jain Joanne

Angela Jang

Roger Jiang

Haozhun Jiang

Denny Jin

Shino Jin

Billie Jomoto

Hee-woo Jonn

Tomer Jun

Łukasz Kaftan

Ali Kaiser

Ingmar Ka-mali

Kanitscheider

Nitish Shirish

Keskar Tabarak

Logan Khan

J. Kilpatrick

Kim Christina

Yongjik Kim

Jan Hendrik Kim

Jamie Kirch-ner

Matt Kiros

Daniel Knight

Kokotajlo Łukasz

A. Kondraciuk

Aris Kondrich

Kyle Kon-stantinidis

Gretchen Kosic

Vishal Krueger

Michael Kuo

Ikai Lampe

Teddy Lan

Jan Lee

Jade Leike

Daniel Leung

Chak Ming Levy

Li Rachel

Molly Lim

Stephanie Lin

Mateusz Lin

Theresa Litwin

Ryan Lopez

Patricia Lowe

Lue Anna

Kim Makanju

S. Malfacini

Todor Manning

Yaniv Markov

Bianca Markovski

Katie Martin

Andrew Mayer

Bob Mayne

Scott Mayer McGrew

Christine McKinney

Paul McLeavey

McMillan Jake

David McNeil

Aalok Medina

Jacob Mehta

Luke Menick

Andrey Metz

Pamela Mishchenko

Vinnie Mishkin

Evan Monaco

Daniel Morikawa

Tong Mossing

Mira Mu

Oleg Murati

David Murk

Ashvin Mély

Reiichiro Nair

Rajeev Nakano

Nayak Arvind

Richard Neelakantan

Hyeonwoo Ngo

Noh Long

Cullen Ouyang

Jakub O’Keefe

Alex Pachocki

J. Paino

Ashley Palermo

Pantuliano

Carl Ross

Bob Rotsted

Henri Roussez

Nick Ry-der

Mario Saltarelli

Ted Sanders

Shibani Santurkar

Girish Sastry

Heather Schmidt

David Schnurr

John Schulman

Daniel Selsam

Kyla Sheppard

Toki Sherbakov

Jessica Shieh

Sarah Shoker

Pranav Shyam

Szymon Sidor

Eric Sigler

Maddie Simens

Jordan Sitkin

Katarina Slama

Ian Sohl

Benjamin D. Sokolowsky

Yang Song

Natalie Staudacher

Clemens Winter

Samuel Wolrich

Hannah Wong

Lauren Workman

Sherwin Wu

Michael Wu

Kai Xiao

Tao Xu

Sarah Yoo

Kevin Yu

Qim-ing Yuan

Wojciech Zaremba

Rowan G. Zellers

Chong Zhang

Marvin Zhang

Tianhao Shengjia Zhao

Ouyang Long

Jeff Wu

Xu Jiang

Diogo Almeida

C. Wainwright

Pamela Mishkin

Sandhini Agarwal

Alex Ray

Jacob Hilton

Fraser Kelton

Luke Miller

Amanda Askell

Peter Welinder

Paul F. Christiano

Jan Leike

Ryan Lowe. 2022

Adam Paszke

Sam Gross

Francisco Massa

Adam Lerer

Gregory Chanan

Trevor Killeen

Ze-Bin Lin

Natalia Gimelshein

L. Antiga

Alban Desmaison

Andreas Köpf

Edward Yang

Zachary DeVito

Martin Raison

A. Tejani

Sasank Chilamkurthy

Benoit Steiner

Giovanni Puccetti

Anna Rogers

Aleksandr Drozd

Felice

Dell’Orletta. 2022. Outlier

Alec Radford

Jong Wook Kim

Chris Hallacy

Aditya Ramesh

Gabriel Goh

Girish Sas-try

J. Clark

Rewon Child

David Luan

Victor Sanh

Alex Webson

Colin Raffel

Stephen H. Bach

Lintang A. Sutawika

Zaid Alyafeai

Antoine Chaffin

Arnaud Stiegler

Arun Raja

Saiful Bari

Canwen Xu

Urmish Thakker

Shanya Sharma Sharma

Eliza Szczechla

Taewoon Kim 0002

Gunjan Chhablani

Ni-hal Nayak

Debajyoti Datta

Mike Jonathan Chang

Tian-Jian Jiang

Han Wang

Matteo Manica

Sheng Shen

Zheng-Xin Yong

Harshit Pandey

Rachel Bawden

Thomas Wang

Trishala Neeraj

Jos Rozen

Abheesht Sharma

Thibault Févry

Jason Alan Fries

Ryan Teehan

Teven Le Scao

Stella Biderman

Leo Gao

Thomas Wolf 0008

A. M. R. 2022

Multi-task

Richard Socher

Alex Perelygin

Jean Wu

Jason Chuang

Christopher D Manning

Andrew Ng

Christopher Potts

Recursive

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal

Md. Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Dsouza

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubara-jan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karaka¸s

Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen

Blake Lin

Bryan Howald

Cameron Orinion

Cameron Diao

Catherine Dour

Cedrick Stinson

César Argueta

Chandan Ferri

Charles Singh

Chenlin Rathkopf

Chitta Meng

C. Baral

Chris Wu

Chris Callison-Burch

Christopher Waites

Christo-pher D Voigt

Cindy Potts

E. RamirezClara

Clemencia Rivera

Colin Siro

Court-ney Raffel

Cristina Ashcraft

Damien Garbacea

Sileo Dan

Dan Garrette

Dan Hendrycks

Dan Kilman

C. Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

A small subset of dimensions within language Transformers’ representation spaces emerge as "outliers" during pretraining, encoding critica… (see more)l knowledge sparsely. We extend previous findings on emergent outliers to Encoder-Decoder Transformers and instruction-finetuned models, and tackle the problem of distilling a student Transformer from a larger teacher Trans-former. Knowledge distillation reduces model size and cost by transferring knowledge from a larger teacher to a smaller student, necessitating a trade-off among representation dimensions. We show that emergent outlier dimensions contribute significantly more to zero-shot performance than non-outlier dimensions. Based on this, we propose the Emergent Outlier Focused Distillation (EOFD) method, which prioritizes critical outlier dimensions in distillation using a weighted MSE loss. We empirically demonstrate that EOFD outperforms state-of-the-art distillation methods and generalizes well across Encoder-only BERT, Decoder-only GPT-2, and Encoder-Decoder T5 architectures.

2025-01-01

North American Chapter of the Association for Computational Linguistics (published)

On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models

Tianyang Zhao

Kunwar Yashraj Singh

Srikar Appalaraju

Peng Tang

Ying Nian Wu

Li Erran Li

Li

Rishabh Agarwal

Nino Vieillard

Yongchao Zhou

Piotr Stańczyk

Sabela Ramos Garea

Matthieu Geist

Rohan Anil

Andrew M. Dai

Melvin Orhan Firat

Dmitry Lepikhin

Alexandre Passos

Siamak Shakeri

Emanuel Taropa … (see 478 more)

Paige Bailey

Zhifeng Chen

Eric Chu

Jonathan H. Clark

Laurent El

Yanping Huang

K. Meier-Hellstern

Gaurav Mishra

Erica Moreira

Mark Omernick

Kevin Robinson

Sebastian Ruder

Yi Tay

Kefan Xiao

Yuanzhong Xu

Yujing Zhang

Gustavo Hernández Abrego

Junwhan Ahn

Jacob Austin

Paul R. Barham

Jan Botha

James Bradbury

Siddhartha Brahma

Kevin Brooks

M. Catasta

Yong Cheng

Colin Cherry

Christopher A. Choquette-Choo

Aakanksha Chowdhery

Clé-ment Crepy

Shachi Dave

Mostafa Dehghani

Sunipa Dev

Jacob Devlin

Mark Díaz

Nan Du

Ethan Dyer

Vladimir Feinberg

Fangxiaoyu Feng

Vlad Fienber

Markus Freitag

Xavier Garcia

Sebastian Gehrmann

Lucas Gonzalez

Guy Gur-Ari

Steven Hand

Hadi Hashemi

Le Hou

Joshua Howland

Andrea Hu

Jeffrey Hui

Jeremy Hur-witz

Michael Acheson Isard

Abe Ittycheriah

Matthew Jagiel-ski

Wenhao Jia

Kathleen Kenealy

M. Krikun

Sneha Kudugunta 0001

Chang Lan

Kather-ine Lee

Benjamin Lee

Music Eric Li

Wei Li

YaGuang Li

Li Jian

Hyeontaek Li

Hanzhao Lim

Zhongtao Lin

Liu Frederick

Marcello Liu

Aroma Maggioni

Mahendru Joshua

Vedant Maynez

Maysam Misra

Moussalem Zachary

John Nado

E. Nham

Andrew Ni

Alicia Nys-trom

Marie Parrish

M. Pellat

Polacek Alex

Reiner Polozov

Siyuan Pope

Emily Qiao

Reif Bryan

Parker Richter

Alex Riley

Castro Ros

Aurko Roy

Brennan Saeta

Rajkumar Samuel

Renee Shelby

Ambrose Slone

Daniel Smilkov

David R. So

Daniel Sohn

Simon Tokumine

Dasha Valter

Haim-ing Bao

Mo Bavarian

Jeff Belgum

Ir-wan Bello

Jake Berdine

Gabriel Bernadett-Shapiro

Christopher Berner

Lenny Bogdonoff

Oleg Boiko

Madelaine Boyd

Anna-Luisa Brakman

Greg Brock-man

Tim Brooks

M. Brundage

Kevin Button

Trevor Cai

Rosie Campbell

Andrew Cann

Brittany Carey

Chelsea Carlson

Rory Carmichael

Brooke Chan

Che Chang

Fotis Chantzis

Derek Chen

Sully Chen

Ruby Chen

Jason Chen

Mark Chen

Benjamin Chess

Chester Cho

Hyung Casey Chu

Won Chung

Dave Cummings

Jeremiah Currier

Yunxing Dai

Tarun Goel

Gabriel Gogineni

Rapha Goh

Jonathan Gontijo-Lopes

Morgan Gordon

Scott Grafstein

Ryan Gray

Joshua Greene

Shixiang Shane Gross

Yufei Gu

Chris Guo

Jesse Hallacy

Jeff Han

Harris Yuchen

Mike He

Johannes Heaton

C. Heidecke

Alan Hesse

Wade Hickey

Peter Hickey

Hoeschele Brandon

Kenny Houghton

Shengli Hsu

Xin Hu

Joost Hu

Shantanu Huizinga

Shawn Jain

Jain Joanne

Angela Jang

Roger Jiang

Haozhun Jiang

Denny Jin

Shino Jin

Billie Jomoto

Hee-woo Jonn

Tomer Jun

Łukasz Kaftan

Ali Kaiser

Ingmar Ka-mali

Kanitscheider

Nitish Shirish

Keskar Tabarak

Logan Khan

J. Kilpatrick

Kim Christina

Yongjik Kim

Jan Hendrik Kim

Jamie Kirch-ner

Matt Kiros

Daniel Knight

Kokotajlo Łukasz

A. Kondraciuk

Aris Kondrich

Kyle Kon-stantinidis

Gretchen Kosic

Vishal Krueger

Michael Kuo

Ikai Lampe

Teddy Lan

Jan Lee

Jade Leike

Daniel Leung

Chak Ming Levy

Li Rachel

Molly Lim

Stephanie Lin

Mateusz Lin

Theresa Litwin

Ryan Lopez

Patricia Lowe

Lue Anna

Kim Makanju

S. Malfacini

Todor Manning

Yaniv Markov

Bianca Markovski

Katie Martin

Andrew Mayer

Bob Mayne

Scott Mayer McGrew

Christine McKinney

Paul McLeavey

McMillan Jake

David McNeil

Aalok Medina

Jacob Mehta

Luke Menick

Andrey Metz

Pamela Mishchenko

Vinnie Mishkin

Evan Monaco

Daniel Morikawa

Tong Mossing

Mira Mu

Oleg Murati

David Murk

Ashvin Mély

Reiichiro Nair

Rajeev Nakano

Nayak Arvind

Richard Neelakantan

Hyeonwoo Ngo

Noh Long

Cullen Ouyang

Jakub O’Keefe

Alex Pachocki

J. Paino

Ashley Palermo

Pantuliano

Carl Ross

Bob Rotsted

Henri Roussez

Nick Ry-der

Mario Saltarelli

Ted Sanders

Shibani Santurkar

Girish Sastry

Heather Schmidt

David Schnurr

John Schulman

Daniel Selsam

Kyla Sheppard

Toki Sherbakov

Jessica Shieh

Sarah Shoker

Pranav Shyam

Szymon Sidor

Eric Sigler

Maddie Simens

Jordan Sitkin

Katarina Slama

Ian Sohl

Benjamin D. Sokolowsky

Yang Song

Natalie Staudacher

Clemens Winter

Samuel Wolrich

Hannah Wong

Lauren Workman

Sherwin Wu

Michael Wu

Kai Xiao

Tao Xu

Sarah Yoo

Kevin Yu

Qim-ing Yuan

Wojciech Zaremba

Rowan G. Zellers

Chong Zhang

Marvin Zhang

Tianhao Shengjia Zhao

Ouyang Long

Jeff Wu

Xu Jiang

Diogo Almeida

C. Wainwright

Pamela Mishkin

Sandhini Agarwal

Alex Ray

Jacob Hilton

Fraser Kelton

Luke Miller

Amanda Askell

Peter Welinder

Paul F. Christiano

Jan Leike

Ryan Lowe. 2022

Adam Paszke

Sam Gross

Francisco Massa

Adam Lerer

Gregory Chanan

Trevor Killeen

Ze-Bin Lin

Natalia Gimelshein

L. Antiga

Alban Desmaison

Andreas Köpf

Edward Yang

Zachary DeVito

Martin Raison

A. Tejani

Sasank Chilamkurthy

Benoit Steiner

Giovanni Puccetti

Anna Rogers

Aleksandr Drozd

Felice

Dell’Orletta. 2022. Outlier

Alec Radford

Jong Wook Kim

Chris Hallacy

Aditya Ramesh

Gabriel Goh

Girish Sas-try

J. Clark

Rewon Child

David Luan

Victor Sanh

Alex Webson

Colin Raffel

Stephen H. Bach

Lintang A. Sutawika

Zaid Alyafeai

Antoine Chaffin

Arnaud Stiegler

Arun Raja

Saiful Bari

Canwen Xu

Urmish Thakker

Shanya Sharma Sharma

Eliza Szczechla

Taewoon Kim 0002

Gunjan Chhablani

Ni-hal Nayak

Debajyoti Datta

Mike Jonathan Chang

Tian-Jian Jiang

Han Wang

Matteo Manica

Sheng Shen

Zheng-Xin Yong

Harshit Pandey

Rachel Bawden

Thomas Wang

Trishala Neeraj

Jos Rozen

Abheesht Sharma

Thibault Févry

Jason Alan Fries

Ryan Teehan

Teven Le Scao

Stella Biderman

Leo Gao

Thomas Wolf 0008

A. M. R. 2022

Multi-task

Richard Socher

Alex Perelygin

Jean Wu

Jason Chuang

Christopher D Manning

Andrew Ng

Christopher Potts

Recursive

Aarohi Srivastava

Abhinav Rastogi

Abhishek Rao

Abu Awal

Md. Shoeb

Abubakar Abid

Adam Fisch

Adam R. Brown

Adam Santoro

Aditya Gupta

Adrià Garriga-Alonso

Agnieszka Kluska

Aitor Lewkowycz

Akshat Agarwal

Alethea Power

Alex Warstadt

Alexander W. Kocurek

Ali Safaya

Ali Tazarv

Alice Xiang

Alicia Parrish

Allen Nie

Aman Hussain

Amanda Dsouza

Ameet Rahane

Anantharaman S. Iyer

Anders Johan Andreassen

Andrea Madotto

Andrea Santilli

Andreas Stuhlmüller

Andrew La

Andrew Lampinen

Andy Zou

Angela Jiang

Angelica Chen

Anh Vuong

Animesh Gupta

Anna Gottardi

Antonio Norelli

Anu Venkatesh

Arash Gholamidavoodi

Arfa Tabassum

Arul Menezes

Arun Kirubara-jan

Asher Mullokandov

Ashish Sabharwal

Austin Herrick

Avia Efrat

Aykut Erdem

Ayla Karaka¸s

Ryan Roberts

Bao Sheng Loe

Barret Zoph

Bartłomiej Bojanowski

Batuhan Özyurt

Behnam Hedayatnia

Behnam Neyshabur

Benjamin Inden

Benno Stein

Berk Ekmekci

Bill Yuchen

Blake Lin

Bryan Howald

Cameron Orinion

Cameron Diao

Catherine Dour

Cedrick Stinson

César Argueta

Chandan Ferri

Charles Singh

Chenlin Rathkopf

Chitta Meng

C. Baral

Chris Wu

Chris Callison-Burch

Christopher Waites

Christo-pher D Voigt

Cindy Potts

E. RamirezClara

Clemencia Rivera

Colin Siro

Court-ney Raffel

Cristina Ashcraft

Damien Garbacea

Sileo Dan

Dan Garrette

Dan Hendrycks

Dan Kilman

C. Roth

C. Daniel Freeman

Daniel Khashabi

Daniel Levy

Daniel Moseguí González

Danielle Perszyk

Danny Hernandez

Danqi Chen

2025-01-01

NAACL (Long Papers) (published)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)