Nicolas Chapados

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin

Maxime Gasse

Massimo Caccia

Issam Hadj Laradji

David Vazquez

Alexandre Lacoste

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuri… (see more)ng the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

2024-03-12

ArXiv (preprint)

WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin

Maxime Gasse

Massimo Caccia

Issam Hadj Laradji

David Vazquez

Alexandre Lacoste

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuri… (see more)ng the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 29 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

2024-03-11

ICLR.cc/2024/Workshop/LLMAgents (poster)

openreview.net

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov

Loubna Ben allal

Federico Cassano

Joel Lamy-Poirier

Nouamane Tazi

Ao Tang

Dmytro Pykhtar

Jiawei Liu

Yuxiang Wei

Tianyang Liu

Max Tian

Denis Kocetkov

Arthur Zucker

Younes Belkada

Zijian Wang

Qian Liu

Dmitry Abulkhanov

Indraneil Paul

Zhuang Li … (see 46 more)

Wen-Ding Li

Megan L. Risdal

Jia LI

Jian Zhu

Terry Yue Zhuo

Evgenii Zheltonozhskii

Nii Osae Osae Dade

Wenhao Yu

Lucas Krauss

Naman Jain

Yixuan Su

Xuanli He

Edoardo Abati

Yekun Chai

Niklas Muennighoff

Xiangru Tang

Muhtasham Oblokulov

Christopher Akiki

Marc Marone

Chenghao Mou

Mayank Mishra

Alex Gu

Binyuan Hui

Tri Dao

Armel Zebaze

Olivier Dehaene

Nicolas Patry

Canwen Xu

Julian McAuley

Han Hu

Torsten Scholak

Sebastien Paquet

Jennifer Robinson

Carolyn Jane Anderson

Md. Mostofa Ali Patwary

Nima Tajbakhsh

Yacine Jernite

Carlos Muñoz Ferrandis

Lingming Zhang

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-29

ArXiv (preprint)