Portrait of Nicholas Meade is unavailable

Nicholas Meade

PhD - McGill University
Supervisor

Publications

StarCoder: may the source be with you!
Raymond Li
Loubna Ben allal
Yangtian Zi
Niklas Muennighoff
Denis Kocetkov
Chenghao Mou
Marc Marone
Christopher Akiki
Jia LI
Jenny Chim
Qian Liu
Evgenii Zheltonozhskii
Terry Yue Zhuo
Thomas Wang
Olivier Dehaene
Mishig Davaadorj
Joel Lamy-Poirier
Joao Monteiro
Oleh Shliazhko
Ming-Ho Yee … (see 49 more)
Nicolas Gontier
Jian Zhu
Nicholas Meade
Armel Zebaze
Logesh Kumar Umapathi
Ben Lipkin
Muhtasham Oblokulov
Zhiruo Wang
Rudra Murthy
Jason T Stillerman
Siva Sankalp Patel
Dmitry Abulkhanov
Marco Zocca
Manan Dey
Zhihan Zhang
N. Fahmy
Urvashi Bhattacharyya
Wenhao Yu
Swayam Singh
Sasha Luccioni
Paulo Villegas
Jan Ebert
M. Kunakov
Fedor Zhdanov
Manuel Romero
Tony Lee
Nadav Timor
Jennifer Ding
Claire S Schlesinger
Hailey Schoelkopf
Jana Ebert
Tri Dao
Mayank Mishra
Alex Gu
Jennifer Robinson
Sean Hughes
Carolyn Jane Anderson
Brendan Dolan-Gavitt
Danish Contractor
Daniel Fried
Yacine Jernite
Carlos Muñoz Ferrandis
Sean M. Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
Harm de Vries
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs)… (see more), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
Using In-Context Learning to Improve Dialogue Safety
Nicholas Meade
Spandana Gella
Devamanyu Hazarika
Prakhar Gupta
Di Jin
Yang Liu
Dilek Hakkani-Tur
Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
Vaibhav Adlakha
Parishad BehnamGhader
Xing Han Lu
Nicholas Meade
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for information-seeking tasks such as … (see more)question answering (QA). By simply prepending retrieved documents in its input along with an instruction, these models can be adapted to various information domains and tasks without additional fine-tuning. While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics such as exact match (EM) and F1 unreliable for accurately quantifying model performance. In this work, we investigate the performance of instruction-following models across three information-seeking QA tasks. We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness), and 2) whether they produce a response based on the provided knowledge (faithfulness). Guided by human evaluation and analysis, we highlight the shortcomings of traditional metrics for both correctness and faithfulness. We then propose simple token-overlap based and model-based metrics that reflect the true performance of these models. Our analysis reveals that instruction-following models are competitive, and sometimes even outperform fine-tuned models for correctness. However, these models struggle to stick to the provided knowledge and often hallucinate in their responses. We hope our work encourages a more holistic evaluation of instruction-following models for QA. Our code and data is available at https://github.com/McGill-NLP/instruct-qa
Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining
Andreas Madsen
Nicholas Meade
Vaibhav Adlakha
To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for makin… (see more)g a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.
An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models
Nicholas Meade
Elinor Poole-Dayan
Recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. This has attract… (see more)ed attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model’s language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.
An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models
Nicholas Meade
Elinor Poole-Dayan
Recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. This has attract… (see more)ed attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model’s language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.
Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining
Andreas Madsen
Nicholas Meade
Vaibhav Adlakha
To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for makin… (see more)g a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.