Portrait of Sasha Luccioni is unavailable

Sasha Luccioni

Alumni

Publications

Open Problems in Technical AI Governance
Anka Reuel
Benjamin Bucknall
Stephen Casper
Timothy Fist
Lisa Soder
Onni Aarne
Lewis Hammond
Lujain Ibrahim
Peter Wills
Markus Anderljung
Ben Garfinkel
Lennart Heim
Andrew Trask
Gabriel Mukobi
Rylan Schaeffer
Mauricio Baker
Sara Hooker
Irene Solaiman
Alexandra Luccioni
Nicolas Moës
Jeffrey Ladish
David Bau
Paul Bricman
Neel Guha
Jessica Newman
Tobin South
Alex Pentland
Sanmi Koyejo
Mykel Kochenderfer
Robert Trager
AI progress is creating a growing range of risks and opportunities, but it is often unclear how they should be navigated. In many cases, the… (see more) barriers and uncertainties faced are at least partly technical. Technical AI governance, referring to technical analysis and tools for supporting the effective governance of AI, seeks to address such challenges. It can help to (a) identify areas where intervention is needed, (b) identify and assess the efficacy of potential governance actions, and (c) enhance governance options by designing mechanisms for enforcement, incentivization, or compliance. In this paper, we explain what technical AI governance is, why it is important, and present a taxonomy and incomplete catalog of its open problems. This paper is intended as a resource for technical researchers or research funders looking to contribute to AI governance.
The Responsible Foundation Model Development Cheatsheet: A Review of Tools&Resources
Shayne Longpre
Stella Biderman
Alon Albalak
Hailey Schoelkopf
Daniel McDuff
Sayash Kapoor
Kevin Klyman
Kyle Lo
Gabriel Ilharco
Nay San
Maribeth Rauh
Aviya Skowron
Bertie Vidgen
Laura Weidinger
Arvind Narayanan
Victor Sanh
Percy Liang
Rishi Bommasani
Yacine Jernite
Luca Soldaini
The Responsible Foundation Model Development Cheatsheet: A Review of Tools & Resources
Shayne Longpre
Stella Biderman
Alon Albalak
Hailey Schoelkopf
Daniel McDuff
Sayash Kapoor
Kevin Klyman
Kyle Lo
Gabriel Ilharco
Nay San
Maribeth Rauh
Aviya Skowron
Bertie Vidgen
Laura Weidinger
Arvind Narayanan
Victor Sanh
Percy Liang
Rishi Bommasani
Yacine Jernite
Luca Soldaini
Foundation model development attracts a rapidly expanding body of contributors, scientists, and applications. To help shape responsible deve… (see more)lopment practices, we introduce the Foundation Model Development Cheatsheet: a growing collection of 250+ tools and resources spanning text, vision, and speech modalities. We draw on a large body of prior work to survey resources (e.g. software, documentation, frameworks, guides, and practical tools) that support informed data selection, processing, and understanding, precise and limitation-aware artifact documentation, efficient model training, advance awareness of the environmental impact from training, careful model evaluation of capabilities, risks, and claims, as well as responsible model release, licensing and deployment practices. We hope this curated collection of resources helps guide more responsible development. The process of curating this list, enabled us to review the AI development ecosystem, revealing what tools are critically missing, misused, or over-used in existing practices. We find that (i) tools for data sourcing, model evaluation, and monitoring are critically under-serving ethical and real-world needs, (ii) evaluations for model safety, capabilities, and environmental impact all lack reproducibility and transparency, (iii) text and particularly English-centric analyses continue to dominate over multilingual and multi-modal analyses, and (iv) evaluation of systems, rather than just models, is needed so that capabilities and impact are assessed in context.
StarCoder: may the source be with you!
Loubna Ben allal
Yangtian Zi
Niklas Muennighoff
Denis Kocetkov
Chenghao Mou
Marc Marone
Christopher Akiki
Jia LI
Jenny Chim
Qian Liu
Evgenii Zheltonozhskii
Terry Yue Zhuo
Thomas Wang
Olivier Dehaene
Mishig Davaadorj
Joel Lamy-Poirier
Joao Monteiro
Oleh Shliazhko
Nicolas Gontier … (see 49 more)
Armel Zebaze
Ming-Ho Yee
Logesh Kumar Umapathi
Jian Zhu
Ben Lipkin
Muhtasham Oblokulov
Zhiruo Wang
Rudra Murthy
Jason T Stillerman
Siva Sankalp Patel
Dmitry Abulkhanov
Marco Zocca
Zhihan Zhang
N. Fahmy
Urvashi Bhattacharyya
Wenhao Yu
Swayam Singh
Paulo Villegas
M. Kunakov
Jan Ebert
Fedor Zhdanov
Manuel Romero
Tony Lee
Nadav Timor
Jennifer Ding
Claire S Schlesinger
Hailey Schoelkopf
Jana Ebert
Tri Dao
Mayank Mishra
Alex Gu
Jennifer Robinson
Sean Hughes
Carolyn Jane Anderson
Brendan Dolan-Gavitt
Danish Contractor
Daniel Fried
Yacine Jernite
Carlos Muñoz Ferrandis
Sean M. Hughes
Thomas Wolf
Arjun Guha
Leandro Von Werra
The BigCode community, an open-scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs)… (see more), introduces StarCoder and StarCoderBase: 15.5B parameter models with 8K context length, infilling capabilities and fast large-batch inference enabled by multi-query attention. StarCoderBase is trained on 1 trillion tokens sourced from The Stack, a large collection of permissively licensed GitHub repositories with inspection tools and an opt-out process. We fine-tuned StarCoderBase on 35B Python tokens, resulting in the creation of StarCoder. We perform the most comprehensive evaluation of Code LLMs to date and show that StarCoderBase outperforms every open Code LLM that supports multiple programming languages and matches or outperforms the OpenAI code-cushman-001 model. Furthermore, StarCoder outperforms every model that is fine-tuned on Python and still retains its performance on other programming languages. We take several important steps towards a safe open-access model release, including an improved PII redaction pipeline and a novel attribution tracing tool, and make the StarCoder models publicly available under a more commercially viable version of the Open Responsible AI Model license.
The BigScience ROOTS Corpus: A 1.6TB Composite Multilingual Dataset
Hugo Laurençon
Lucile Saulnier
Thomas Wang
Christopher Akiki
Albert Villanova del Moral
Teven Le Scao
Leandro Von Werra
Chenghao Mou
Eduardo González Ponferrada
Huu Nguyen
Jörg Frohberg
Mario Šaško
Quentin Lhoest
Angelina McMillan-Major
Gérard Dupont
Stella Biderman
Anna Rogers
Loubna Ben allal
Francesco De Toni
Giada Pistilli … (see 34 more)
Olivier Nguyen
Somaieh Nikpoor
Maraim Masoud
Pierre Colombo
Javier de la Rosa
Paulo Villegas
Tristan Thrush
Shayne Longpre
Sebastian Nagel
Leon Weber
Manuel Romero Muñoz
Jian Zhu
Daniel Van Strien
Zaid Alyafeai
Khalid Almubarak
Vu Minh Chien
Itziar Gonzalez-Dios
Aitor Soroa
Kyle Lo
Pedro Ortiz Suarez
Aaron Gokaslan
Shamik Bose
Long Phan
Hieu Tran
Ian Yu
Suhas Pai
Jenny Chim
Violette Lepercq
Suzana Ilic
Margaret Mitchell
Yacine Jernite
As language models grow ever larger, the need for large-scale high-quality text datasets has never been more pressing, especially in multili… (see more)ngual settings. The BigScience workshop, a 1-year international and multidisciplinary initiative, was formed with the goal of researching and training large language models as a values-driven undertaking, putting issues of ethics, harm, and governance in the foreground. This paper documents the data creation and curation efforts undertaken by BigScience to assemble the Responsible Open-science Open-collaboration Text Sources (ROOTS) corpus, a 1.6TB dataset spanning 59 languages that was used to train the 176-billion-parameter BigScience Large Open-science Open-access Multilingual (BLOOM) language model. We further release a large initial subset of the corpus and analyses thereof, and hope to empower large-scale monolingual and multilingual modeling projects with both the data and the processing tools, as well as stimulate research around this large multilingual corpus.