Dzmitry Bahdanau

Torsten Scholak

Despite the huge success of Large Language Models (LLMs) in coding assistants like GitHub Copilot, these models struggle to understand the c… (voir plus)ontext present in the repository (e.g., imports, parent classes, files with similar names, etc.), thereby producing inaccurate code completions. This effect is more pronounced when using these assistants for repositories that the model has not seen during training, such as proprietary software or work-in-progress code projects. Recent work has shown the promise of using context from the repository during inference. In this work, we extend this idea and propose RepoFusion, a framework to train models to incorporate relevant repository context. Experiments on single-line code completion show that our models trained with repository context significantly outperform much larger code models as CodeGen-16B-multi (

2023-06-19

ArXiv (prépublication)

The Stack: 3 TB of permissively licensed source code

Denis Kocetkov

Raymond Li

Loubna Ben allal

Jia LI

Chenghao Mou

Carlos Muñoz Ferrandis

Yacine Jernite

Margaret Mitchell

Sean Hughes

Thomas Wolf

Leandro Von Werra

Harm de Vries

Large Language Models (LLMs) play an ever-increasing role in the field of Artificial Intelligence (AI)--not only for natural language proces… (voir plus)sing but also for code understanding and generation. To stimulate open and responsible research on LLMs for code, we introduce The Stack, a 3.1 TB dataset consisting of permissively licensed source code in 30 programming languages. We describe how we collect the full dataset, construct a permissively licensed subset, present a data governance plan, discuss limitations, and show promising results on text2code benchmarks by training 350M-parameter decoders on different Python subsets. We find that (1) near-deduplicating the data significantly boosts performance across all experiments, and (2) it is possible to match previously reported HumanEval and MBPP performance using only permissively licensed data. We make the dataset available at https://hf.co/BigCode, provide a tool called"Am I in The Stack"(https://hf.co/spaces/bigcode/in-the-stack) for developers to search The Stack for copies of their code, and provide a process for code to be removed from the dataset by following the instructions at https://www.bigcode-project.org/docs/about/the-stack/.

2023-06-06

TMLR (accepté)

SantaCoder: don't reach for the stars!

Loubna Ben allal

Raymond Li

Denis Kocetkov

Chenghao Mou

Christopher Akiki

Carlos Muñoz Ferrandis

Niklas Muennighoff

Mayank Mishra

Alex Gu

Manan Dey

Logesh Kumar Umapathi

Carolyn Jane Anderson

Yangtian Zi

Joel Lamy Poirier

Hailey Schoelkopf

S. Troshin

Dmitry Abulkhanov

Manuel L. Romero

M. Lappert

Francesco De Toni … (voir 21 de plus)

Bernardo Garc'ia del R'io

Qian Liu

Shamik Bose

Urvashi Bhattacharyya

Terry Yue Zhuo

Ian Yu

Paulo Villegas

Marco Zocca

Sourab Mangrulkar

D. Lansky

Huu Nguyen

Danish Contractor

Luisa Villa

Jia LI

Yacine Jernite

Sean Christopher Hughes

Daniel Fried

Arjun Guha

Harm de Vries

Leandro Von Werra

The BigCode project is an open-scientific collaboration working on the responsible development of large language models for code. This tech … (voir plus)report describes the progress of the collaboration until December 2022, outlining the current state of the Personally Identifiable Information (PII) redaction pipeline, the experiments conducted to de-risk the model architecture, and the experiments investigating better preprocessing methods for the training data. We train 1.1B parameter models on the Java, JavaScript, and Python subsets of The Stack and evaluate them on the MultiPL-E text-to-code benchmark. We find that more aggressive filtering of near-duplicates can further boost performance and, surprisingly, that selecting files from repositories with 5+ GitHub stars deteriorates performance significantly. Our best model outperforms previous open-source multilingual code generation models (InCoder-6.7B and CodeGen-Multi-2.7B) in both left-to-right generation and infilling on the Java, JavaScript, and Python portions of MultiPL-E, despite being a substantially smaller model. All models are released under an OpenRAIL license at https://hf.co/bigcode.

2023-01-09

ArXiv (prépublication)

MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations

Arkil Patel

Satwik Bhattamishra

Siva Reddy

2023-01-01

EMNLP (publié)

PromptMix: A Class Boundary Augmentation Method for Large Language Model Distillation

Gaurav Sahu

Olga Vechtomova

Issam Hadj Laradji

Data augmentation is a widely used technique to address the problem of text classification when there is a limited amount of training data. … (voir plus)Recent work often tackles this problem using large language models (LLMs) like GPT3 that can generate new examples given already available ones. In this work, we propose a method to generate more helpful augmented data by utilizing the LLM's abilities to follow instructions and perform few-shot classifications. Our specific PromptMix method consists of two steps: 1) generate challenging text augmentations near class boundaries; however, generating borderline examples increases the risk of false positives in the dataset, so we 2) relabel the text augmentations using a prompting-based LLM classifier to enhance the correctness of labels in the generated data. We evaluate the proposed method in challenging 2-shot and zero-shot settings on four text classification datasets: Banking77, TREC6, Subjectivity (SUBJ), and Twitter Complaints. Our experiments show that generating and, crucially, relabeling borderline examples facilitates the transfer of knowledge of a massive LLM like GPT3.5-turbo into smaller and cheaper classifiers like DistilBERT

2023-01-01

EMNLP (publié)

On the Compositional Generalization Gap of In-Context Learning

Arian Hosseini

Ankit Vani

Alessandro Sordoni

Aaron Courville

Pretrained large generative language models have shown great performance on many tasks, but exhibit low compositional generalization abiliti… (voir plus)es. Scaling such models has been shown to improve their performance on various NLP tasks even just by conditioning them on a few examples to solve the task without any fine-tuning (also known as in-context learning). In this work, we look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. In the ID settings, the demonstrations are from the same split (\textit{test} or \textit{train}) that the model is being evaluated on, and in the OOD settings, they are from the other split. We look at how the relative generalization gap of in-context learning evolves as models are scaled up. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets, CFQ, SCAN and GeoQuery with different number of exemplars, and observe a trend of decreasing relative generalization gap as models are scaled up.

2022-11-15

ArXiv (prépublication)

Compositional Generalization in Dependency Parsing

Emily D. Goodwin

Siva Reddy

Timothy O'Donnell

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

Compositional Generalization in Dependency Parsing

Emily D. Goodwin

Siva Reddy

Timothy O'Donnell

Compositionality— the ability to combine familiar units like words into novel phrases and sentences— has been the focus of intense inter… (voir plus)est in artificial intelligence in recent years. To test compositional generalization in semantic parsing, Keysers et al. (2020) introduced Compositional Freebase Queries (CFQ). This dataset maximizes the similarity between the test and train distributions over primitive units, like words, while maximizing the compound divergence: the dissimilarity between test and train distributions over larger structures, like phrases. Dependency parsing, however, lacks a compositional generalization benchmark. In this work, we introduce a gold-standard set of dependency parses for CFQ, and use this to analyze the behaviour of a state-of-the art dependency parser (Qi et al., 2020) on the CFQ dataset. We find that increasing compound divergence degrades dependency parsing performance, although not as dramatically as semantic parsing performance. Additionally, we find the performance of the dependency parser does not uniformly degrade relative to compound divergence, and the parser performs differently on different splits with the same compound divergence. We explore a number of hypotheses for what causes the non-uniform degradation in dependency parsing performance, and identify a number of syntactic structures that drive the dependency parser’s lower performance on the most challenging splits.

2021-10-13

ArXiv (preprint)

Combating False Negatives in Adversarial Imitation Learning

Konrad Żołna

Chitwan Saharia

Léonard Boussioux

David Y. T. Hui

Maxime Chevalier-Boisvert

Yoshua Bengio

In adversarial imitation learning, a discriminator is trained to differentiate agent episodes from expert demonstrations representing the de… (voir plus)sired behavior. However, as the trained policy learns to be more successful, the negative examples (the ones produced by the agent) become increasingly similar to expert ones. Despite the fact that the task is successfully accomplished in some of the agent's trajectories, the discriminator is trained to output low values for them. We hypothesize that this inconsistent training signal for the discriminator can impede its learning, and consequently leads to worse overall performance of the agent. We show experimental evidence for this hypothesis and that the ‘False Negatives’ (i.e. successful agent episodes) significantly hinder adversarial imitation learning, which is the first contribution of this paper. Then, we propose a method to alleviate the impact of false negatives and test it on the BabyAI environment. This method consistently improves sample efficiency over the baselines by at least an order of magnitude.

2021-07-18

2021 International Joint Conference on Neural Networks (IJCNN) (publié)

Understanding by Understanding Not: Modeling Negation in Language Models

Arian Hosseini

Negation is a core construction in natural language. Despite being very successful on many tasks, state-of-the-art pre-trained language mode… (voir plus)ls often handle negation incorrectly. To improve language models in this regard, we propose to augment the language modeling objective with an unlikelihood objective that is based on negated generic sentences from a raw text corpus. By training BERT with the resulting combined objective we reduce the mean top 1 error rate to 4% on the negated LAMA dataset. We also see some improvements on the negated NLI benchmarks.

2021-06-01

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (publié)

BabyAI 1.1

David Y. T. Hui

Maxime Chevalier-Boisvert

Yoshua Bengio

BabyAI 1.1

David Y. T. Hui

Maxime Chevalier-Boisvert