ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing
Chelse Swoopes
Priyan Vaithilingam
Martin Wattenberg
Elena L. Glassman
Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that… (voir plus) go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.
Designing and Evaluating Dialogue LLMs for Co-Creative Improvised Theatre
Boyd Branch
Piotr Mirowski
Sophia Ppali
Alexandra Covaci
Social robotics researchers are increasingly interested in multi-party trained conversational agents. With a growing demand for real-world e… (voir plus)valuations, our study presents Large Language Models (LLMs) deployed in a month-long live show at the Edinburgh Festival Fringe. This case study investigates human improvisers co-creating with conversational agents in a professional theatre setting. We explore the technical capabilities and constraints of on-the-spot multi-party dialogue, providing comprehensive insights from both audience and performer experiences with AI on stage. Our human-in-the-loop methodology underlines the challenges of these LLMs in generating context-relevant responses, stressing the user interface's crucial role. Audience feedback indicates an evolving interest for AI-driven live entertainment, direct human-AI interaction, and a diverse range of expectations about AI's conversational competence and utility as a creativity support tool. Human performers express immense enthusiasm, varied satisfaction, and the evolving public opinion highlights mixed emotions about AI's role in arts.
DirectGPT: A Direct Manipulation Interface to Interact with Large Language Models
Sylvain Malacria
Géry Casiez
Daniel Vogel
How different mental models of AI-based writing assistants impact writers’ interactions with them
Shalaleh Rismani
Su Lin Blodgett
Q. Vera Liao
Calibration‐free parallel transmission of the cervical, thoracic, and lumbar spinal cord at <scp>7T</scp>
Christoph S. Aigner
Manuel F. Sánchez Alarcon
Alexandre D'Astous
Eva Alonso‐Ortiz
Sebastian Schmitter
Calibration-free parallel transmission of the cervical, thoracic, and lumbar spinal cord at 7T.
Christoph S. Aigner
Manuel F. Sánchez Alarcon
Alexandre D'Astous
Eva Alonso‐Ortiz
Sebastian Schmitter
PURPOSE To address the limitations of spinal cord imaging at ultra-high field (UHF) due to time-consuming parallel transmit (pTx) adjustment… (voir plus)s. This study introduces calibration-free offline computed universal shim modes that can be applied seamlessly for different pTx RF coils and spinal cord target regions, substantially enhancing spinal cord imaging efficiency at UHF. METHODS A library of channel-wise relative B 1 +
Calibration-free parallel transmission of the cervical, thoracic, and lumbar spinal cord at 7T.
Christoph S. Aigner
Manuel F. Sánchez Alarcon
Alexandre D'Astous
Eva Alonso‐Ortiz
Sebastian Schmitter
PURPOSE To address the limitations of spinal cord imaging at ultra-high field (UHF) due to time-consuming parallel transmit (pTx) adjustment… (voir plus)s. This study introduces calibration-free offline computed universal shim modes that can be applied seamlessly for different pTx RF coils and spinal cord target regions, substantially enhancing spinal cord imaging efficiency at UHF. METHODS A library of channel-wise relative B 1 +
Exploring the digital divide: results of a survey informing mobile application development
Maira Corinne Claudio
Zachary Rehany
Katerina Stachtari
Elena Guadagno
Esli Osmanlliu
Introduction Mobile health apps risk widening health disparities if they overlook digital inclusion. The digital divide, encompassing access… (voir plus), familiarity, and readiness, poses a significant barrier to medical interventions. Existing literature lacks exploration of the digital divide's contributing factors. Hence, data are needed to comprehend the challenges in developing inclusive health apps. Methods We created a survey to gauge internet and smartphone access, smartphone familiarity, and readiness for using mobile health apps among caregivers of pediatric patients in tertiary care. Open-ended questions solicited feedback and suggestions on mobile health applications. Responses were categorized by similarity and compared. Developed with patient partners, the survey underwent cognitive testing and piloting for accuracy. Results Data from 209 respondents showed that 23% were affected by the digital divide, mainly due to unfamiliarity with digital skills. Among 49 short text responses about health app concerns, 31 mentioned security and confidentiality, with 7 mentioning the impersonal nature of such apps. Desired features included messaging healthcare providers, scheduling, task reminders, and simplicity. Conclusions This study underscores a digital divide among caregivers of pediatric patients, with nearly a quarter affected primarily due to a lack of digital comfort. Respondents emphasized user-friendliness and online security for health apps. Future apps should prioritize digital inclusion by addressing the significant barriers and carefully considering patient and family concerns.
Repeat it without me: Crowdsourcing the T1 mapping common ground via the ISMRM reproducibility challenge
Mathieu Boudreau
Agah Karakuzu
Ecem Bozkurt
Madeline Carr
Marco Castellaro
Luis Concha
Mariya Doneva
Seraina A. Dual
Alex Ensworth
Alexandru Foias
Véronique Fortier
Refaat E. Gabr
Guillaume Gilbert
Carri K. Glide‐Hurst
Matthew Grech‐Sollars
Siyuan Hu
Oscar Jalnefjord
Jorge Jovicich
Kübra Keskin … (voir 22 de plus)
Peter Koken
Anastasia Kolokotronis
Simran Kukran
Nam G. Lee
Ives R. Levesque
Bochao Li
Dan Ma
Burkhard Mädler
Nyasha G. Maforo
Jamie Near
Erick Pasaye
Alonso Ramirez‐Manzanares
Ben Statton
Christian Stehning
Stefano Tambalo
Ye Tian
Chenyang Wang
Kilian Weiss
Niloufar Zakariaei
Shuo Zhang
Ziwei Zhao
Nikola Stikov
T1 mapping is a widely used quantitative MRI technique, but its tissue‐specific values remain inconsistent across protocols, sites, and ve… (voir plus)ndors. The ISMRM Reproducible Research and Quantitative MR study groups jointly launched a challenge to assess the reproducibility of a well‐established inversion‐recovery T1 mapping technique, using acquisition details from a seminal T1 mapping paper on a standardized phantom and in human brains.
Repeat it without me: Crowdsourcing the T1 mapping common ground via the ISMRM reproducibility challenge.
Mathieu Boudreau
Agah Karakuzu
Ecem Bozkurt
Madeline Carr
Marco Castellaro
Luis Concha
Mariya Doneva
Seraina A. Dual
Alex Ensworth
Alexandru Foias
Véronique Fortier
Refaat E. Gabr
Guillaume Gilbert
Carri K. Glide‐Hurst
Matthew Grech‐Sollars
Siyuan Hu
Oscar Jalnefjord
Jorge Jovicich
Kübra Keskin … (voir 22 de plus)
Peter Koken
Anastasia Kolokotronis
Simran Kukran
Nam G. Lee
Ives R. Levesque
Bochao Li
Dan Ma
Burkhard Mädler
Nyasha G. Maforo
Jamie Near
Erick Pasaye
Alonso Ramirez‐Manzanares
Ben Statton
Christian Stehning
Stefano Tambalo
Ye Tian
Chenyang Wang
Kilian Weiss
Niloufar Zakariaei
Shuo Zhang
Ziwei Zhao
Nikola Stikov
PURPOSE T1 mapping is a widely used quantitative MRI technique, but its tissue-specific values remain inconsistent across protocols, sites, … (voir plus)and vendors. The ISMRM Reproducible Research and Quantitative MR study groups jointly launched a challenge to assess the reproducibility of a well-established inversion-recovery T1 mapping technique, using acquisition details from a seminal T1 mapping paper on a standardized phantom and in human brains. METHODS The challenge used the acquisition protocol from Barral et al. (2010). Researchers collected T1 mapping data on the ISMRM/NIST phantom and/or in human brains. Data submission, pipeline development, and analysis were conducted using open-source platforms. Intersubmission and intrasubmission comparisons were performed. RESULTS Eighteen submissions (39 phantom and 56 human datasets) on scanners by three MRI vendors were collected at 3 T (except one, at 0.35 T). The mean coefficient of variation was 6.1% for intersubmission phantom measurements, and 2.9% for intrasubmission measurements. For humans, the intersubmission/intrasubmission coefficient of variation was 5.9/3.2% in the genu and 16/6.9% in the cortex. An interactive dashboard for data visualization was also developed: https://rrsg2020.dashboards.neurolibre.org. CONCLUSION The T1 intersubmission variability was twice as high as the intrasubmission variability in both phantoms and human brains, indicating that the acquisition details in the original paper were insufficient to reproduce a quantitative MRI protocol. This study reports the inherent uncertainty in T1 measures across independent research groups, bringing us one step closer to a practical clinical baseline of T1 variations in vivo.
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Max Viktor Skalse
Stuart Russell
Max Tegmark
Sanjit A. Seshia
Steve Omohundro
Christian Szegedy
Ben Goldhaber
Nora Ammann
Alessandro Abate
Joe Halpern
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua B. Tenenbaum
Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with … (voir plus)a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems
David Dalrymple
Joar Max Viktor Skalse
Stuart Russell
Max Tegmark
Sanjit A. Seshia
Steve Omohundro
Christian Szegedy
Ben Goldhaber
Nora Ammann
Alessandro Abate
Joe Halpern
Clark Barrett
Ding Zhao
Zhi-Xuan Tan
Jeannette Wing
Joshua B. Tenenbaum
Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with … (voir plus)a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.