Portrait of Manan Dey is unavailable

Manan Dey

Alumni

Publications

MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen
Isaac Chung
Márton Kardos
Ashwin Mathur
David Stap
Wissam Siblini
Dominik Krzemiński
Genta Indra Winata
Saba Sturua
Saiteja Utpala
Mathieu Ciancone
Marion Schaeffer
Gabriel Sequeira
Shreeya Dhakal
Jonathan Rystrøm
Roman Solomatin
Ömer Veysel Çağatan … (see 66 more)
Akash Kundu
Martin Bernstorff
Shitao Xiao
Akshita Sukhlecha
Bhavish Pahwa
Rafał Poświata
Kranthi Kiran GV
Shawon Ashraf
Daniel Auras
Björn Plüster
Jan Philipp Harries
Loïc Magne
Isabelle Mohr
Mariya Hendriksen
Dawei Zhu
Hippolyte Gisserot-Boukhlef
Tom Aarsen
Jan Kostkan
Konrad Wojtasik
Taemin Lee
Marek Suppa
Crystina Zhang
Roberta Rocca
Mohammed Hamdy
Andrianos Michail
John Yang
Manuel Faysse
Aleksei Vatolin
Nandan Thakur
Dipam Vasani
Pranjal A Chitale
Simone Tedeschi
Nguyen Tai
Artem Snegirev
Michael Günther
Mengzhou Xia
Weijia Shi
Jordan Clive
Gayatri K
Maksimova Anna
Silvan Wehrli
Maria Tikhonova
Henil Shalin Panchal
Aleksandr Abramov
Malte Ostendorff
Zheng Liu
Simon Clematide
Lester James Validad Miranda
Alena Fenogenova
Guangyu Song
Ruqiya Bin Safi
Wen-Ding Li
Alessia Borghini
Federico Cassano
Hongjin Su
Jimmy Lin
Howard Yen
Lasse Hansen
Sara Hooker
Chenghao Xiao
Orion Weller
Niklas Muennighoff
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen
Isaac Chung
Márton Kardos
Ashwin Mathur
David Stap
Wissam Siblini
Dominik Krzemiński
Genta Indra Winata
Saba Sturua
Saiteja Utpala
Mathieu Ciancone
Marion Schaeffer
Shreeya Dhakal
Jonathan Rystrøm
Roman Solomatin
Ömer Veysel Çağatan
Akash Kundu … (see 62 more)
Martin Bernstorff
Shitao Xiao
Akshita Sukhlecha
Bhavish Pahwa
Rafał Poświata
Kranthi Kiran GV
Shawon Ashraf
Daniel Auras
Björn Plüster
Jan Philipp Harries
Loïc Magne
Isabelle Mohr
Dawei Zhu
Hippolyte Gisserot-Boukhlef
Tom Aarsen
Jan Kostkan
Konrad Wojtasik
Taemin Lee
Marek Suppa
Crystina Zhang
Roberta Rocca
Mohammed Hamdy
Andrianos Michail
John Yang
Manuel Faysse
Aleksei Vatolin
Nandan Thakur
Dipam Vasani
Pranjal A Chitale
Simone Tedeschi
Nguyen Tai
Artem Snegirev
Mariya Hendriksen
Michael Günther
Mengzhou Xia
Weijia Shi
Jordan Clive
Gayatri K
Maksimova Anna
Silvan Wehrli
Maria Tikhonova
Henil Shalin Panchal
Aleksandr Abramov
Malte Ostendorff
Zheng Liu
Simon Clematide
Lester James Validad Miranda
Alena Fenogenova
Guangyu Song
Ruqiya Bin Safi
Wen-Ding Li
Alessia Borghini
Federico Cassano
Lasse Hansen
Sara Hooker
Chenghao Xiao
Orion Weller
Niklas Muennighoff
Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen
Isaac Chung
Márton Kardos
Ashwin Mathur
David Stap
Wissam Siblini
Dominik Krzemiński
Genta Indra Winata
Saba Sturua
Saiteja Utpala
Mathieu Ciancone
Marion Schaeffer
Gabriel Sequeira
Shreeya Dhakal
Jonathan Rystrøm
Roman Solomatin
Ömer Veysel Çağatan … (see 66 more)
Akash Kundu
Martin Bernstorff
Shitao Xiao
Akshita Sukhlecha
Bhavish Pahwa
Rafał Poświata
Kranthi Kiran GV
Shawon Ashraf
Daniel Auras
Björn Plüster
Jan Philipp Harries
Loïc Magne
Isabelle Mohr
Mariya Hendriksen
Dawei Zhu
Hippolyte Gisserot-Boukhlef
Tom Aarsen
Jan Kostkan
Konrad Wojtasik
Taemin Lee
Marek Suppa
Crystina Zhang
Roberta Rocca
Mohammed Hamdy
Andrianos Michail
John Yang
Manuel Faysse
Aleksei Vatolin
Nandan Thakur
Dipam Vasani
Pranjal A Chitale
Simone Tedeschi
Nguyen Tai
Artem Snegirev
Michael Günther
Mengzhou Xia
Weijia Shi
Jordan Clive
Gayatri K
Maksimova Anna
Silvan Wehrli
Maria Tikhonova
Henil Shalin Panchal
Aleksandr Abramov
Malte Ostendorff
Zheng Liu
Simon Clematide
Lester James Validad Miranda
Alena Fenogenova
Guangyu Song
Ruqiya Bin Safi
Wen-Ding Li
Alessia Borghini
Federico Cassano
Hongjin Su
Jimmy Lin
Howard Yen
Lasse Hansen
Sara Hooker
Chenghao Xiao
Orion Weller
Niklas Muennighoff
Text embeddings are typically evaluated on a narrow set of tasks, limited in terms of languages, domains, and task types. To circumvent this… (see more) limitation and to provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) -- a large-scale community-driven initiative expanding MTEB to over 500 quality-controlled evaluation tasks across 1,000+ languages. MMTEB includes a wide range of challenging novel tasks such as instruction following, long-document retrieval, and code retrieval, and represents the largest multilingual collection of evaluation tasks for embedding models to date. We use this collection to construct multiple highly multilingual benchmarks. We evaluate a representative set of models on these benchmarks. Our findings indicate that, while LLM-based models can achieve state-of-the-art performance on a subset of languages, the best-performing publicly available model across languages is the notably smaller, multilingual-e5-large-instruct. Massive benchmarks often impose high computational demands, limiting accessibility, particularly for low-resource communities. To address this, we downsample tasks based on inter-task correlation (i.e., selecting only a diverse set of tasks) while preserving relative rankings. We further optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks at a significantly lower computational cost. For instance, we introduce a new zero-shot English benchmark that maintains a similar ordering at a fraction of the cost.
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Tianyang Zhao
Kunwar Yashraj Singh
Srikar Appalaraju
Peng Tang
Ying Nian Wu
Li Erran Li
Li
Nino Vieillard
Yongchao Zhou
Piotr Stańczyk
Sabela Ramos Garea
Matthieu Geist
Rohan Anil
Andrew M. Dai
Melvin Orhan Firat
Dmitry Lepikhin
Alexandre Passos
Siamak Shakeri
Emanuel Taropa … (see 478 more)
Paige Bailey
Zhifeng Chen
Eric Chu
Jonathan H. Clark
Laurent El
Yanping Huang
K. Meier-Hellstern
Gaurav Mishra
Erica Moreira
Mark Omernick
Kevin Robinson
Sebastian Ruder
Yi Tay
Kefan Xiao
Yuanzhong Xu
Yujing Zhang
Gustavo Hernández Abrego
Junwhan Ahn
Jacob Austin
Paul R. Barham
Jan Botha
James Bradbury
Siddhartha Brahma
Kevin Brooks
M. Catasta
Yong Cheng
Colin Cherry
Christopher A. Choquette-Choo
Aakanksha Chowdhery
Clé-ment Crepy
Shachi Dave
Mostafa Dehghani
Sunipa Dev
Jacob Devlin
Mark Díaz
Nan Du
Ethan Dyer
Vladimir Feinberg
Fangxiaoyu Feng
Vlad Fienber
Markus Freitag
Xavier Garcia
Sebastian Gehrmann
Lucas Gonzalez
Guy Gur-Ari
Steven Hand
Hadi Hashemi
Le Hou
Joshua Howland
Andrea Hu
Jeffrey Hui
Jeremy Hur-witz
Michael Acheson Isard
Abe Ittycheriah
Matthew Jagiel-ski
Wenhao Jia
Kathleen Kenealy
M. Krikun
Sneha Kudugunta 0001
Chang Lan
Kather-ine Lee
Benjamin Lee
Music Eric Li
Wei Li
YaGuang Li
Li Jian
Hyeontaek Li
Hanzhao Lim
Zhongtao Lin
Liu Frederick
Marcello Liu
Aroma Maggioni
Mahendru Joshua
Vedant Maynez
Maysam Misra
Moussalem Zachary
John Nado
E. Nham
Andrew Ni
Alicia Nys-trom
Marie Parrish
M. Pellat
Polacek Alex
Reiner Polozov
Siyuan Pope
Emily Qiao
Reif Bryan
Parker Richter
Alex Riley
Castro Ros
Aurko Roy
Brennan Saeta
Rajkumar Samuel
Renee Shelby
Ambrose Slone
Daniel Smilkov
David R. So
Daniel Sohn
Simon Tokumine
Dasha Valter
Haim-ing Bao
Mo Bavarian
Jeff Belgum
Ir-wan Bello
Jake Berdine
Gabriel Bernadett-Shapiro
Christopher Berner
Lenny Bogdonoff
Oleg Boiko
Madelaine Boyd
Anna-Luisa Brakman
Greg Brock-man
Tim Brooks
M. Brundage
Kevin Button
Trevor Cai
Rosie Campbell
Andrew Cann
Brittany Carey
Chelsea Carlson
Rory Carmichael
Brooke Chan
Che Chang
Fotis Chantzis
Derek Chen
Sully Chen
Ruby Chen
Jason Chen
Mark Chen
Benjamin Chess
Chester Cho
Hyung Casey Chu
Won Chung
Dave Cummings
Jeremiah Currier
Yunxing Dai
Tarun Goel
Gabriel Gogineni
Rapha Goh
Jonathan Gontijo-Lopes
Morgan Gordon
Scott Grafstein
Ryan Gray
Joshua Greene
Shixiang Shane Gross
Yufei Gu
Chris Guo
Jesse Hallacy
Jeff Han
Harris Yuchen
Mike He
Johannes Heaton
C. Heidecke
Alan Hesse
Wade Hickey
Peter Hickey
Hoeschele Brandon
Kenny Houghton
Shengli Hsu
Xin Hu
Joost Hu
Shantanu Huizinga
Shawn Jain
Jain Joanne
Angela Jang
Roger Jiang
Haozhun Jiang
Denny Jin
Shino Jin
Billie Jomoto
Hee-woo Jonn
Tomer Jun
Łukasz Kaftan
Ali Kaiser
Ingmar Ka-mali
Kanitscheider
Nitish Shirish
Keskar Tabarak
Logan Khan
J. Kilpatrick
Kim Christina
Yongjik Kim
Jan Hendrik Kim
Jamie Kirch-ner
Matt Kiros
Daniel Knight
Kokotajlo Łukasz
A. Kondraciuk
Aris Kondrich
Kyle Kon-stantinidis
Gretchen Kosic
Vishal Krueger
Michael Kuo
Ikai Lampe
Teddy Lan
Jan Lee
Jade Leike
Daniel Leung
Chak Ming Levy
Li Rachel
Molly Lim
Stephanie Lin
Mateusz Lin
Theresa Litwin
Ryan Lopez
Patricia Lowe
Lue Anna
Kim Makanju
S. Malfacini
Todor Manning
Yaniv Markov
Bianca Markovski
Katie Martin
Andrew Mayer
Bob Mayne
Scott Mayer McGrew
Christine McKinney
Paul McLeavey
McMillan Jake
David McNeil
Aalok Medina
Jacob Mehta
Luke Menick
Andrey Metz
Pamela Mishchenko
Vinnie Mishkin
Evan Monaco
Daniel Morikawa
Tong Mossing
Mira Mu
Oleg Murati
David Murk
Ashvin Mély
Reiichiro Nair
Rajeev Nakano
Nayak Arvind
Richard Neelakantan
Hyeonwoo Ngo
Noh Long
Cullen Ouyang
Jakub O’Keefe
Alex Pachocki
J. Paino
Ashley Palermo
Pantuliano
Carl Ross
Bob Rotsted
Henri Roussez
Nick Ry-der
Mario Saltarelli
Ted Sanders
Shibani Santurkar
Girish Sastry
Heather Schmidt
David Schnurr
John Schulman
Daniel Selsam
Kyla Sheppard
Toki Sherbakov
Jessica Shieh
Sarah Shoker
Pranav Shyam
Szymon Sidor
Eric Sigler
Maddie Simens
Jordan Sitkin
Katarina Slama
Ian Sohl
Benjamin D. Sokolowsky
Yang Song
Natalie Staudacher
Clemens Winter
Samuel Wolrich
Hannah Wong
Lauren Workman
Sherwin Wu
Michael Wu
Kai Xiao
Tao Xu
Sarah Yoo
Kevin Yu
Qim-ing Yuan
Wojciech Zaremba
Rowan G. Zellers
Chong Zhang
Marvin Zhang
Tianhao Shengjia Zhao
Ouyang Long
Jeff Wu
Xu Jiang
Diogo Almeida
C. Wainwright
Pamela Mishkin
Sandhini Agarwal
Alex Ray
Jacob Hilton
Fraser Kelton
Luke Miller
Amanda Askell
Peter Welinder
Paul F. Christiano
Jan Leike
Ryan Lowe. 2022
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
Gregory Chanan
Trevor Killeen
Ze-Bin Lin
Natalia Gimelshein
L. Antiga
Alban Desmaison
Andreas Köpf
Edward Yang
Zachary DeVito
Martin Raison
A. Tejani
Sasank Chilamkurthy
Benoit Steiner
Giovanni Puccetti
Anna Rogers
Aleksandr Drozd
Felice
Dell’Orletta. 2022. Outlier
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya Ramesh
Gabriel Goh
Girish Sas-try
J. Clark
Rewon Child
David Luan
Victor Sanh
Alex Webson
Colin Raffel
Stephen H. Bach
Lintang A. Sutawika
Zaid Alyafeai
Antoine Chaffin
Arnaud Stiegler
Arun Raja
Saiful Bari
Canwen Xu
Urmish Thakker
Shanya Sharma Sharma
Eliza Szczechla
Taewoon Kim 0002
Gunjan Chhablani
Ni-hal Nayak
Debajyoti Datta
Mike Jonathan Chang
Tian-Jian Jiang
Han Wang
Matteo Manica
Sheng Shen
Zheng-Xin Yong
Harshit Pandey
Rachel Bawden
Thomas Wang
Trishala Neeraj
Jos Rozen
Abheesht Sharma
Thibault Févry
Jason Alan Fries
Ryan Teehan
Teven Le Scao
Stella Biderman
Leo Gao
Thomas Wolf 0008
A. M. R. 2022
Multi-task
Richard Socher
Alex Perelygin
Jean Wu
Jason Chuang
Christopher D Manning
Andrew Ng
Christopher Potts
Recursive
Aarohi Srivastava
Abhinav Rastogi
Abhishek Rao
Abu Awal
Md. Shoeb
Abubakar Abid
Adam Fisch
Adam R. Brown
Adam Santoro
Aditya Gupta
Adrià Garriga-Alonso
Agnieszka Kluska
Aitor Lewkowycz
Akshat Agarwal
Alethea Power
Alex Warstadt
Alexander W. Kocurek
Ali Safaya
Ali Tazarv
Alice Xiang
Alicia Parrish
Allen Nie
Aman Hussain
Amanda Dsouza
Ameet Rahane
Anantharaman S. Iyer
Anders Johan Andreassen
Andrea Madotto
Andrea Santilli
Andreas Stuhlmüller
Andrew La
Andrew Lampinen
Andy Zou
Angela Jiang
Angelica Chen
Anh Vuong
Animesh Gupta
Anna Gottardi
Antonio Norelli
Anu Venkatesh
Arash Gholamidavoodi
Arfa Tabassum
Arul Menezes
Arun Kirubara-jan
Asher Mullokandov
Ashish Sabharwal
Austin Herrick
Avia Efrat
Aykut Erdem
Ayla Karaka¸s
Ryan Roberts
Bao Sheng Loe
Barret Zoph
Bartłomiej Bojanowski
Batuhan Özyurt
Behnam Hedayatnia
Behnam Neyshabur
Benjamin Inden
Benno Stein
Berk Ekmekci
Bill Yuchen
Blake Lin
Bryan Howald
Cameron Orinion
Cameron Diao
Catherine Dour
Cedrick Stinson
César Argueta
Chandan Ferri
Charles Singh
Chenlin Rathkopf
Chitta Meng
C. Baral
Chris Wu
Chris Callison-Burch
Christopher Waites
Christo-pher D Voigt
Cindy Potts
E. RamirezClara
Clemencia Rivera
Colin Siro
Court-ney Raffel
Cristina Ashcraft
Damien Garbacea
Sileo Dan
Dan Garrette
Dan Hendrycks
Dan Kilman
C. Roth
C. Daniel Freeman
Daniel Khashabi
Daniel Moseguí González
Danielle Perszyk
Danny Hernandez
Danqi Chen
A small subset of dimensions within language Transformers’ representation spaces emerge as "outliers" during pretraining, encoding critica… (see more)l knowledge sparsely. We extend previous findings on emergent outliers to Encoder-Decoder Transformers and instruction-finetuned models, and tackle the problem of distilling a student Transformer from a larger teacher Trans-former. Knowledge distillation reduces model size and cost by transferring knowledge from a larger teacher to a smaller student, necessitating a trade-off among representation dimensions. We show that emergent outlier dimensions contribute significantly more to zero-shot performance than non-outlier dimensions. Based on this, we propose the Emergent Outlier Focused Distillation (EOFD) method, which prioritizes critical outlier dimensions in distillation using a weighted MSE loss. We empirically demonstrate that EOFD outperforms state-of-the-art distillation methods and generalizes well across Encoder-only BERT, Decoder-only GPT-2, and Encoder-Decoder T5 architectures.
On the Analysis and Distillation of Emergent Outlier Properties in Pre-trained Language Models
Tianyang Zhao
Kunwar Yashraj Singh
Srikar Appalaraju
Peng Tang
Ying Nian Wu
Li Erran Li
Li
Nino Vieillard
Yongchao Zhou
Piotr Stańczyk
Sabela Ramos Garea
Matthieu Geist
Rohan Anil
Andrew M. Dai
Melvin Orhan Firat
Dmitry Lepikhin
Alexandre Passos
Siamak Shakeri
Emanuel Taropa … (see 478 more)
Paige Bailey
Zhifeng Chen
Eric Chu
Jonathan H. Clark
Laurent El
Yanping Huang
K. Meier-Hellstern
Gaurav Mishra
Erica Moreira
Mark Omernick
Kevin Robinson
Sebastian Ruder
Yi Tay
Kefan Xiao
Yuanzhong Xu
Yujing Zhang
Gustavo Hernández Abrego
Junwhan Ahn
Jacob Austin
Paul R. Barham
Jan Botha
James Bradbury
Siddhartha Brahma
Kevin Brooks
M. Catasta
Yong Cheng
Colin Cherry
Christopher A. Choquette-Choo
Aakanksha Chowdhery
Clé-ment Crepy
Shachi Dave
Mostafa Dehghani
Sunipa Dev
Jacob Devlin
Mark Díaz
Nan Du
Ethan Dyer
Vladimir Feinberg
Fangxiaoyu Feng
Vlad Fienber
Markus Freitag
Xavier Garcia
Sebastian Gehrmann
Lucas Gonzalez
Guy Gur-Ari
Steven Hand
Hadi Hashemi
Le Hou
Joshua Howland
Andrea Hu
Jeffrey Hui
Jeremy Hur-witz
Michael Acheson Isard
Abe Ittycheriah
Matthew Jagiel-ski
Wenhao Jia
Kathleen Kenealy
M. Krikun
Sneha Kudugunta 0001
Chang Lan
Kather-ine Lee
Benjamin Lee
Music Eric Li
Wei Li
YaGuang Li
Li Jian
Hyeontaek Li
Hanzhao Lim
Zhongtao Lin
Liu Frederick
Marcello Liu
Aroma Maggioni
Mahendru Joshua
Vedant Maynez
Maysam Misra
Moussalem Zachary
John Nado
E. Nham
Andrew Ni
Alicia Nys-trom
Marie Parrish
M. Pellat
Polacek Alex
Reiner Polozov
Siyuan Pope
Emily Qiao
Reif Bryan
Parker Richter
Alex Riley
Castro Ros
Aurko Roy
Brennan Saeta
Rajkumar Samuel
Renee Shelby
Ambrose Slone
Daniel Smilkov
David R. So
Daniel Sohn
Simon Tokumine
Dasha Valter
Haim-ing Bao
Mo Bavarian
Jeff Belgum
Ir-wan Bello
Jake Berdine
Gabriel Bernadett-Shapiro
Christopher Berner
Lenny Bogdonoff
Oleg Boiko
Madelaine Boyd
Anna-Luisa Brakman
Greg Brock-man
Tim Brooks
M. Brundage
Kevin Button
Trevor Cai
Rosie Campbell
Andrew Cann
Brittany Carey
Chelsea Carlson
Rory Carmichael
Brooke Chan
Che Chang
Fotis Chantzis
Derek Chen
Sully Chen
Ruby Chen
Jason Chen
Mark Chen
Benjamin Chess
Chester Cho
Hyung Casey Chu
Won Chung
Dave Cummings
Jeremiah Currier
Yunxing Dai
Tarun Goel
Gabriel Gogineni
Rapha Goh
Jonathan Gontijo-Lopes
Morgan Gordon
Scott Grafstein
Ryan Gray
Joshua Greene
Shixiang Shane Gross
Yufei Gu
Chris Guo
Jesse Hallacy
Jeff Han
Harris Yuchen
Mike He
Johannes Heaton
C. Heidecke
Alan Hesse
Wade Hickey
Peter Hickey
Hoeschele Brandon
Kenny Houghton
Shengli Hsu
Xin Hu
Joost Hu
Shantanu Huizinga
Shawn Jain
Jain Joanne
Angela Jang
Roger Jiang
Haozhun Jiang
Denny Jin
Shino Jin
Billie Jomoto
Hee-woo Jonn
Tomer Jun
Łukasz Kaftan
Ali Kaiser
Ingmar Ka-mali
Kanitscheider
Nitish Shirish
Keskar Tabarak
Logan Khan
J. Kilpatrick
Kim Christina
Yongjik Kim
Jan Hendrik Kim
Jamie Kirch-ner
Matt Kiros
Daniel Knight
Kokotajlo Łukasz
A. Kondraciuk
Aris Kondrich
Kyle Kon-stantinidis
Gretchen Kosic
Vishal Krueger
Michael Kuo
Ikai Lampe
Teddy Lan
Jan Lee
Jade Leike
Daniel Leung
Chak Ming Levy
Li Rachel
Molly Lim
Stephanie Lin
Mateusz Lin
Theresa Litwin
Ryan Lopez
Patricia Lowe
Lue Anna
Kim Makanju
S. Malfacini
Todor Manning
Yaniv Markov
Bianca Markovski
Katie Martin
Andrew Mayer
Bob Mayne
Scott Mayer McGrew
Christine McKinney
Paul McLeavey
McMillan Jake
David McNeil
Aalok Medina
Jacob Mehta
Luke Menick
Andrey Metz
Pamela Mishchenko
Vinnie Mishkin
Evan Monaco
Daniel Morikawa
Tong Mossing
Mira Mu
Oleg Murati
David Murk
Ashvin Mély
Reiichiro Nair
Rajeev Nakano
Nayak Arvind
Richard Neelakantan
Hyeonwoo Ngo
Noh Long
Cullen Ouyang
Jakub O’Keefe
Alex Pachocki
J. Paino
Ashley Palermo
Pantuliano
Carl Ross
Bob Rotsted
Henri Roussez
Nick Ry-der
Mario Saltarelli
Ted Sanders
Shibani Santurkar
Girish Sastry
Heather Schmidt
David Schnurr
John Schulman
Daniel Selsam
Kyla Sheppard
Toki Sherbakov
Jessica Shieh
Sarah Shoker
Pranav Shyam
Szymon Sidor
Eric Sigler
Maddie Simens
Jordan Sitkin
Katarina Slama
Ian Sohl
Benjamin D. Sokolowsky
Yang Song
Natalie Staudacher
Clemens Winter
Samuel Wolrich
Hannah Wong
Lauren Workman
Sherwin Wu
Michael Wu
Kai Xiao
Tao Xu
Sarah Yoo
Kevin Yu
Qim-ing Yuan
Wojciech Zaremba
Rowan G. Zellers
Chong Zhang
Marvin Zhang
Tianhao Shengjia Zhao
Ouyang Long
Jeff Wu
Xu Jiang
Diogo Almeida
C. Wainwright
Pamela Mishkin
Sandhini Agarwal
Alex Ray
Jacob Hilton
Fraser Kelton
Luke Miller
Amanda Askell
Peter Welinder
Paul F. Christiano
Jan Leike
Ryan Lowe. 2022
Adam Paszke
Sam Gross
Francisco Massa
Adam Lerer
Gregory Chanan
Trevor Killeen
Ze-Bin Lin
Natalia Gimelshein
L. Antiga
Alban Desmaison
Andreas Köpf
Edward Yang
Zachary DeVito
Martin Raison
A. Tejani
Sasank Chilamkurthy
Benoit Steiner
Giovanni Puccetti
Anna Rogers
Aleksandr Drozd
Felice
Dell’Orletta. 2022. Outlier
Alec Radford
Jong Wook Kim
Chris Hallacy
Aditya Ramesh
Gabriel Goh
Girish Sas-try
J. Clark
Rewon Child
David Luan
Victor Sanh
Alex Webson
Colin Raffel
Stephen H. Bach
Lintang A. Sutawika
Zaid Alyafeai
Antoine Chaffin
Arnaud Stiegler
Arun Raja
Saiful Bari
Canwen Xu
Urmish Thakker
Shanya Sharma Sharma
Eliza Szczechla
Taewoon Kim 0002
Gunjan Chhablani
Ni-hal Nayak
Debajyoti Datta
Mike Jonathan Chang
Tian-Jian Jiang
Han Wang
Matteo Manica
Sheng Shen
Zheng-Xin Yong
Harshit Pandey
Rachel Bawden
Thomas Wang
Trishala Neeraj
Jos Rozen
Abheesht Sharma
Thibault Févry
Jason Alan Fries
Ryan Teehan
Teven Le Scao
Stella Biderman
Leo Gao
Thomas Wolf 0008
A. M. R. 2022
Multi-task
Richard Socher
Alex Perelygin
Jean Wu
Jason Chuang
Christopher D Manning
Andrew Ng
Christopher Potts
Recursive
Aarohi Srivastava
Abhinav Rastogi
Abhishek Rao
Abu Awal
Md. Shoeb
Abubakar Abid
Adam Fisch
Adam R. Brown
Adam Santoro
Aditya Gupta
Adrià Garriga-Alonso
Agnieszka Kluska
Aitor Lewkowycz
Akshat Agarwal
Alethea Power
Alex Warstadt
Alexander W. Kocurek
Ali Safaya
Ali Tazarv
Alice Xiang
Alicia Parrish
Allen Nie
Aman Hussain
Amanda Dsouza
Ameet Rahane
Anantharaman S. Iyer
Anders Johan Andreassen
Andrea Madotto
Andrea Santilli
Andreas Stuhlmüller
Andrew La
Andrew Lampinen
Andy Zou
Angela Jiang
Angelica Chen
Anh Vuong
Animesh Gupta
Anna Gottardi
Antonio Norelli
Anu Venkatesh
Arash Gholamidavoodi
Arfa Tabassum
Arul Menezes
Arun Kirubara-jan
Asher Mullokandov
Ashish Sabharwal
Austin Herrick
Avia Efrat
Aykut Erdem
Ayla Karaka¸s
Ryan Roberts
Bao Sheng Loe
Barret Zoph
Bartłomiej Bojanowski
Batuhan Özyurt
Behnam Hedayatnia
Behnam Neyshabur
Benjamin Inden
Benno Stein
Berk Ekmekci
Bill Yuchen
Blake Lin
Bryan Howald
Cameron Orinion
Cameron Diao
Catherine Dour
Cedrick Stinson
César Argueta
Chandan Ferri
Charles Singh
Chenlin Rathkopf
Chitta Meng
C. Baral
Chris Wu
Chris Callison-Burch
Christopher Waites
Christo-pher D Voigt
Cindy Potts
E. RamirezClara
Clemencia Rivera
Colin Siro
Court-ney Raffel
Cristina Ashcraft
Damien Garbacea
Sileo Dan
Dan Garrette
Dan Hendrycks
Dan Kilman
C. Roth
C. Daniel Freeman
Daniel Khashabi
Daniel Moseguí González
Danielle Perszyk
Danny Hernandez
Danqi Chen
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.
StarCoder 2 and The Stack v2: The Next Generation
Anton Lozhkov
Loubna Ben allal
Federico Cassano
Joel Lamy-Poirier
Nouamane Tazi
Ao Tang
Dmytro Pykhtar
Jiawei Liu
Yuxiang Wei
Tianyang Liu
Max Tian
Denis Kocetkov
Arthur Zucker
Younes Belkada
Zijian Wang
Qian Liu
Dmitry Abulkhanov
Indraneil Paul
Zhuang Li … (see 46 more)
Wen-Ding Li
Megan L. Risdal
Jia LI
Jian Zhu
Terry Yue Zhuo
Evgenii Zheltonozhskii
Nii Osae Osae Dade
Wenhao Yu
Lucas Krauss
Naman Jain
Yixuan Su
Xuanli He
Edoardo Abati
Yekun Chai
Niklas Muennighoff
Xiangru Tang
Muhtasham Oblokulov
Christopher Akiki
Marc Marone
Chenghao Mou
Mayank Mishra
Alex Gu
Binyuan Hui
Tri Dao
Armel Zebaze
Olivier Dehaene
Nicolas Patry
Canwen Xu
Julian McAuley
Han Hu
Torsten Scholak
Sebastien Paquet
Jennifer Robinson
Carolyn Jane Anderson
Md. Mostofa Ali Patwary
Nima Tajbakhsh
Yacine Jernite
Carlos Muñoz Ferrandis
Lingming Zhang
Sean Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.