Ian Arawjo

2025-04-12

ArXiv (preprint)

Semantic Commit: Helping Users Update Intent Specifications for AI Memory at Scale

Priyan Vaithilingam

Munyeong Kim

Frida-Cecilia Acosta-Parenteau

Daniel Lee

Amine Mhedhbi

Elena L. Glassman

2025-04-01

arXiv (published)

Dynamic Abstractions: Building the Next Generation of Cognitive Tools and Interfaces

Sangho Suh

Hai Dang

Ryan Yen

Josh M. Pollock

Rubaiat Habib Kazi

Hariharan Subramonyam

Jingyi Li

Nazmus Saquib

Arvind Satyanarayan

2024-10-13

The 37th Annual ACM Symposium on User Interface Software and Technology (published)

ChainBuddy: An AI Agent System for Generating LLM Pipelines

Jingyue Zhang

As large language models (LLMs) advance, their potential applications have grown significantly. However, it remains difficult to evaluate LL… (see more)M behavior on user-specific tasks and craft effective pipelines to do so. Many users struggle with where to start, often referred to as the"blank page"problem. ChainBuddy, an AI assistant for generating evaluative LLM pipelines built into the ChainForge platform, aims to tackle this issue. ChainBuddy offers a straightforward and user-friendly way to plan and evaluate LLM behavior, making the process less daunting and more accessible across a wide range of possible tasks and use cases. We report a within-subjects user study comparing ChainBuddy to the baseline interface. We find that when using AI assistance, participants reported a less demanding workload and felt more confident setting up evaluation pipelines of LLM behavior. We derive insights for the future of interfaces that assist users in the open-ended evaluation of AI.

2024-09-20

ArXiv (preprint)

ChainBuddy: An AI-assisted Agent System for Generating LLM Pipelines

Jingyue Zhang

2024-09-20

ArXiv (preprint)

ChainBuddy: An AI-assisted Agent System for Generating LLM Pipelines

Jingyue Zhang

2024-09-20

ArXiv (preprint)

Imagining a Future of Designing with AI: Dynamic Grounding, Constructive Negotiation, and Sustainable Motivation

Priyan Vaithilingam

Elena L. Glassman

2024-07-01

Designing Interactive Systems Conference (published)

An AI-Resilient Text Rendering Technique for Reading and Skimming Documents

Ziwei Gu

Kenneth Li

Jonathan K. Kummerfeld

Elena L. Glassman

2024-05-11

Proceedings of the CHI Conference on Human Factors in Computing Systems (published)

ChainForge: A Visual Toolkit for Prompt Engineering and LLM Hypothesis Testing

Chelse Swoopes

Priyan Vaithilingam

Martin Wattenberg

Elena L. Glassman

Evaluating outputs of large language models (LLMs) is challenging, requiring making -- and making sense of -- many responses. Yet tools that… (see more) go beyond basic prompting tend to require knowledge of programming APIs, focus on narrow domains, or are closed-source. We present ChainForge, an open-source visual toolkit for prompt engineering and on-demand hypothesis testing of text generation LLMs. ChainForge provides a graphical interface for comparison of responses across models and prompt variations. Our system was designed to support three tasks: model selection, prompt template design, and hypothesis testing (e.g., auditing). We released ChainForge early in its development and iterated on its design with academics and online users. Through in-lab and interview studies, we find that a range of people could use ChainForge to investigate hypotheses that matter to them, including in real-world settings. We identify three modes of prompt engineering and LLM hypothesis testing: opportunistic exploration, limited evaluation, and iterative refinement.

2024-05-11

Proceedings of the CHI Conference on Human Factors in Computing Systems (published)

Schrödinger's Update: User Perceptions of Uncertainties in Proprietary Large Language Model Updates

Zilin Ma

Yiyang Mei

Krzysztof Z. Gajos

2024-05-02

CHI Extended Abstracts (published)

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar

J.D. Zamfirescu-Pereira

Bjorn Hartmann

Aditya G Parameswaran

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (see more)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

2024-04-18

ArXiv (preprint)

Who Validates the Validators? Aligning LLM-Assisted Evaluation of LLM Outputs with Human Preferences

Shreya Shankar

J.D. Zamfirescu-Pereira

Bjorn Hartmann

Aditya G Parameswaran

Due to the cumbersome nature of human evaluation and limitations of code-based evaluation, Large Language Models (LLMs) are increasingly bei… (see more)ng used to assist humans in evaluating LLM outputs. Yet LLM-generated evaluators simply inherit all the problems of the LLMs they evaluate, requiring further human validation. We present a mixed-initiative approach to ``validate the validators'' -- aligning LLM-generated evaluation functions (be it prompts or code) with human requirements. Our interface, EvalGen, provides automated assistance to users in generating evaluation criteria and implementing assertions. While generating candidate implementations (Python functions, LLM grader prompts), EvalGen asks humans to grade a subset of LLM outputs; this feedback is used to select implementations that better align with user grades. A qualitative study finds overall support for EvalGen but underscores the subjectivity and iterative process of alignment. In particular, we identify a phenomenon we dub \emph{criteria drift}: users need criteria to grade outputs, but grading outputs helps users define criteria. What is more, some criteria appears \emph{dependent} on the specific LLM outputs observed (rather than independent criteria that can be defined \emph{a priori}), raising serious questions for approaches that assume the independence of evaluation from observation of model outputs. We present our interface and implementation details, a comparison of our algorithm with a baseline approach, and implications for the design of future LLM evaluation assistants.

2024-04-18

ArXiv (preprint)