Harm de Vries

The BigCode project, an open-scientific collaboration focused on the responsible development of Large Language Models for Code (Code LLMs), … (see more)introduces StarCoder2. In partnership with Software Heritage (SWH), we build The Stack v2 on top of the digital commons of their source code archive. Alongside the SWH repositories spanning 619 programming languages, we carefully select other high-quality data sources, such as GitHub pull requests, Kaggle notebooks, and code documentation. This results in a training set that is 4x larger than the first StarCoder dataset. We train StarCoder2 models with 3B, 7B, and 15B parameters on 3.3 to 4.3 trillion tokens and thoroughly evaluate them on a comprehensive set of Code LLM benchmarks. We find that our small model, StarCoder2-3B, outperforms other Code LLMs of similar size on most benchmarks, and also outperforms StarCoderBase-15B. Our large model, StarCoder2- 15B, significantly outperforms other models of comparable size. In addition, it matches or outperforms CodeLlama-34B, a model more than twice its size. Although DeepSeekCoder- 33B is the best-performing model at code completion for high-resource languages, we find that StarCoder2-15B outperforms it on math and code reasoning benchmarks, as well as several low-resource languages. We make the model weights available under an OpenRAIL license and ensure full transparency regarding the training data by releasing the SoftWare Heritage persistent IDentifiers (SWHIDs) of the source code data.

2024-02-28

ArXiv (preprint)

RepoFusion: Training Code Models to Understand Your Repository

Disha Shrivastava

Denis Kocetkov

Torsten Scholak

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the c… (see more)ontext present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi (

2023-06-18

ArXiv (preprint)

openreview.net

The Stack: 3 TB of permissively licensed source code

Denis Kocetkov

Raymond Li

Loubna Ben allal

Jia LI

Chenghao Mou

Carlos Muñoz Ferrandis

Yacine Jernite

Margaret Mitchell

Sean Hughes

Thomas Wolf

Leandro Von Werra

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language proces… (see more)sing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called"Am I in The Stack"(https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

2023-06-05

TMLR (accepted)

openreview.net

SantaCoder: don't reach for the stars!

Loubna Ben allal

Raymond Li

Denis Kocetkov

Chenghao Mou

Christopher Akiki

Carlos Muñoz Ferrandis

Niklas Muennighoff

Mayank Mishra

Alex Gu

Manan Dey

Logesh Kumar Umapathi

Carolyn Jane Anderson

Yangtian Zi

Joel Lamy Poirier

Hailey Schoelkopf

S. Troshin

Dmitry Abulkhanov

Manuel L. Romero

M. Lappert

Francesco De Toni … (see 21 more)

Bernardo Garc'ia del R'io

Qian Liu

Shamik Bose

Urvashi Bhattacharyya

Terry Yue Zhuo

Ian Yu

Paulo Villegas

Marco Zocca

Sourab Mangrulkar

D. Lansky

Huu Nguyen

Danish Contractor

Luisa Villa

Jia LI

Yacine Jernite

Sean Christopher Hughes

Daniel Fried

Arjun Guha

Leandro Von Werra

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech … (see more)report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.

2023-01-08

ArXiv (preprint)

StarCoder: may the source be with you!

Raymond Li

Loubna Ben allal

Yangtian Zi

Niklas Muennighoff

Denis Kocetkov

Chenghao Mou

Marc Marone

Christopher Akiki

Jia LI

Jenny Chim

Qian Liu

Evgenii Zheltonozhskii

Terry Yue Zhuo

Thomas Wang

Olivier Dehaene

Mishig Davaadorj

Joel Lamy-Poirier

Joao Monteiro

Oleh Shliazhko

Nicolas Gontier … (see 47 more)

Nicholas Meade

Armel Zebaze

Ming-Ho Yee

Logesh Kumar Umapathi

Jian Zhu

Ben Lipkin

Muhtasham Oblokulov

Zhiruo Wang

Rudra Murthy

Jason T Stillerman

Siva Sankalp Patel

Dmitry Abulkhanov

Marco Zocca

Manan Dey

Zhihan Zhang

N. Fahmy

Urvashi Bhattacharyya

Wenhao Yu

Swayam Singh

Sasha Luccioni

Paulo Villegas

M. Kunakov

Fedor Zhdanov

Manuel Romero

Tony Lee

Nadav Timor

Jennifer Ding

Claire S Schlesinger

Hailey Schoelkopf

Jan Ebert

Tri Dao

Mayank Mishra

Alex Gu

Jennifer Robinson

Carolyn Jane Anderson

Brendan Dolan-Gavitt

Danish Contractor

Daniel Fried

Yacine Jernite

Carlos Muñoz Ferrandis

Sean Hughes

Thomas Wolf

Arjun Guha

Leandro Von Werra

The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs)… (see more), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.

2022-12-31

Trans. Mach. Learn. Res. (published)

openreview.net

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

Xing Han Lu

We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online use… (see more)rs looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.

2022-12-31

EACL (published)

The Power of Prompt Tuning for Low-Resource Semantic Parsing

Nathan Schucher

Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language understanding and… (see more) generation tasks. In this paper, we investigate prompt tuning for semantic parsing—the task of mapping natural language utterances onto formal meaning representations. On the low-resource splits of Overnight and TOPv2, we find that a prompt tuned T5-xl significantly outperforms its fine-tuned counterpart, as well as strong GPT-3 and BART baselines. We also conduct ablation studies across different model scales and target representations, finding that, with increasing model scale, prompt tuned T5 models improve at generating target representations that are far from the pre-training distribution.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (published)

TopiOCQA: Open-domain Conversational Question Answering with Topic Switching

Vaibhav Adlakha

Shehzaad Dhuliawala

Kaheer Suleman

2022-04-12

Transactions of the Association for Computational Linguistics (published)

Generative Compositional Augmentations for Scene Graph Prediction

Boris Knyazev

Cătălina Cangea

Graham W. Taylor

Aaron Courville

Eugene Belilovsky

Inferring objects and their relationships from an image in the form of a scene graph is useful in many applications at the intersection of v… (see more)ision and language. We consider a challenging problem of compositional generalization that emerges in this task due to a long tail data distribution. Current scene graph generation models are trained on a tiny fraction of the distribution corresponding to the most frequent compositions, e.g. . However, test images might contain zero- and few-shot compositions of objects and relationships, e.g. . Despite each of the object categories and the predicate (e.g. 'on') being frequent in the training data, the models often fail to properly understand such unseen or rare compositions. To improve generalization, it is natural to attempt increasing the diversity of the training distribution. However, in the graph domain this is non-trivial. To that end, we propose a method to synthesize rare yet plausible scene graphs by perturbing real ones. We then propose and empirically study a model based on conditional generative adversarial networks (GANs) that allows us to generate visual features of perturbed scene graphs and learn from them in a joint fashion. When evaluated on the Visual Genome dataset, our approach yields marginal, but consistent improvements in zero- and few-shot metrics. We analyze the limitations of our approach indicating promising directions for future research.

2021-09-30

2021 IEEE/CVF International Conference on Computer Vision (ICCV) (published)

Graph Density-Aware Losses for Novel Compositions in Scene Graph Generation

Boris Knyazev

Cătălina Cangea

Graham W. Taylor

Aaron Courville

Eugene Belilovsky

Scene graph generation (SGG) aims to predict graph-structured descriptions of input images, in the form of objects and relationships between… (see more) them. This task is becoming increasingly useful for progress at the interface of vision and language. Here, it is important - yet challenging - to perform well on novel (zero-shot) or rare (few-shot) compositions of objects and relationships. In this paper, we identify two key issues that limit such generalization. Firstly, we show that the standard loss used in this task is unintentionally a function of scene graph density. This leads to the neglect of individual edges in large sparse graphs during training, even though these contain diverse few-shot examples that are important for generalization. Secondly, the frequency of relationships can create a strong bias in this task, such that a blind model predicting the most frequent relationship achieves good performance. Consequently, some state-of-the-art models exploit this bias to improve results. We show that such models can suffer the most in their ability to generalize to rare compositions, evaluating two different models on the Visual Genome dataset and its more recent, improved version, GQA. To address these issues, we introduce a density-normalized edge loss, which provides more than a two-fold improvement in certain generalization metrics. Compared to other works in this direction, our enhancements require only a few lines of code and no added computational cost. We also highlight the difficulty of accurately evaluating models using existing metrics, especially on zero/few shots, and introduce a novel weighted metric.

2019-12-31

Proceedings of the British Machine Vision Conference 2020 (published)

CLOSURE: Assessing Systematic Generalization of CLEVR Models

Timothy J. O'Donnell

Shikhar Murty

Philippe Beaudoin

Yoshua Bengio

Aaron Courville

The CLEVR dataset of natural-looking questions about 3D-rendered scenes has recently received much attention from the research community. A … (see more)number of models have been proposed for this task, many of which achieved very high accuracies of around 97-99%. In this work, we study how systematic the generalization of such models is, that is to which extent they are capable of handling novel combinations of known linguistic constructs. To this end, we test models' understanding of referring expressions based on matching object properties (such as e.g. "the object that is the same size as the red ball") in novel contexts. Our experiments on the thereby constructed CLOSURE benchmark show that state-of-the-art models often do not exhibit systematicity after being trained on CLEVR. Surprisingly, we find that an explicitly compositional Neural Module Network model also generalizes badly on CLOSURE, even when it has access to the ground-truth programs at test time. We improve the NMN's systematic generalization by developing a novel Vector-NMN module architecture with vector-valued inputs and outputs. Lastly, we investigate the extent to which few-shot transfer learning can help models that are pretrained on CLEVR to adapt to CLOSURE. Our few-shot learning experiments contrast the adaptation behavior of the models with intermediate discrete programs with that of the end-to-end continuous models.

2019-12-11

ArXiv (preprint)