Chengzhi Mao

How do large language models (LLMs) obtain their answers? The ability to explain and control an LLM's reasoning process is key for reliabili… (voir plus)ty, transparency, and future model developments. We propose SelfIE (Self-Interpretation of Embeddings), a framework that enables LLMs to interpret their own embeddings in natural language by leveraging their ability to respond to inquiries about a given passage. Capable of interpreting open-world concepts in the hidden embeddings, SelfIE reveals LLM internal reasoning in cases such as making ethical decisions, internalizing prompt injection, and recalling harmful knowledge. SelfIE's text descriptions on hidden embeddings also open up new avenues to control LLM reasoning. We propose Supervised Control, which allows editing open-ended concepts while only requiring gradient computation of individual layer. We extend RLHF to hidden embeddings and propose Reinforcement Control that erases harmful knowledge in LLM without supervision targets.

2024-05-01

ICML.cc/2024/Conference (poster)

Towards Causal Deep Learning for Vulnerability Detection

Md Mahbubur Rahman

Ira Ceka

Saikat Chakraborty

Baishakhi Ray

Wei Le

Deep learning vulnerability detection has shown promising results in recent years. However, an important challenge that still blocks it from… (voir plus) being very useful in practice is that the model is not robust under perturbation and it cannot generalize well over the out-of-distribution (OOD) data, e.g., applying a trained model to unseen projects in real world. We hypothesize that this is because the model learned non-robust features, e.g., variable names, that have spurious correlations with labels. When the perturbed and OOD datasets no longer have the same spurious features, the model prediction fails. To address the challenge, in this paper, we introduced causality into deep learning vulnerability detection. Our approach CausalVul consists of two phases. First, we designed novel perturbations to discover spurious features that the model may use to make predictions. Second, we applied the causal learning algorithms, specifically, do-calculus, on top of existing deep learning models to systematically remove the use of spurious features and thus promote causal based prediction. Our results show that CausalVul consistently improved the model accuracy, robustness and OOD performance for all the state-of-the-art models and datasets we experimented. To the best of our knowledge, this is the first work that introduces do calculus based causal learning to software engineering models and shows it's indeed useful for improving the model accuracy, robustness and generalization. Our replication package is located at https://figshare.com/s/0ffda320dcb96c249ef2.

2024-04-12

Proceedings of the IEEE/ACM 46th International Conference on Software Engineering (publié)

arxiv.org

INViTE: INterpret and Control Vision-Language Models with Text Explanations

Haozhe Chen

Junfeng Yang

Carl Vondrick

Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to the… (voir plus)ir black-box nature, understanding the underlying rules behind these models’ predictions and controlling model behaviors have remained open challenges. We present INViTE: a framework for INterpreting Vision Transformer’s latent tokens with Text Explanations. Given a latent token, INViTE retains its semantic information to the final layer using transformer’s local operations and retrieves the closest text for explanation. INViTE enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, INViTE allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations. Our code is available at https://github.com/tonychenxyz/vit-interpret.

2024-01-16

ICLR.cc/2024/Conference (poster)

Raidar: geneRative AI Detection viA Rewriting

Carl Vondrick

Hao Wang

Junfeng Yang

We find that large language models (LLMs) are more likely to modify human-written text than AI-generated text when tasked with rewriting. Th… (voir plus)is tendency arises because LLMs often perceive AI-generated text as high-quality, leading to fewer modifications. We introduce a method to detect AI-generated content by prompting LLMs to rewrite text and calculating the editing distance of the output. We dubbed our geneRative AI Detection viA Rewriting method Raidar. Raidar significantly improves the F1 detection scores of existing AI content detection models -- both academic and commercial -- across various domains, including News, creative writing, student essays, code, Yelp reviews, and arXiv papers, with gains of up to 29 points. Operating solely on word symbols without high-dimensional features, our method is compatible with black box LLMs, and is inherently robust on new content. Our results illustrate the unique imprint of machine-generated text through the lens of the machines themselves.

2024-01-16

ICLR.cc/2024/Conference (poster)

Interpreting and Controlling Vision Foundation Models via Text Explanations

Haozhe Chen

Junfeng Yang

Carl Vondrick

2023-10-16

ArXiv (prépublication)

arxiv.org

Robust Perception through Equivariance

Lingyu Zhang

Abhishek Vaibhav Joshi

Junfeng Yang

Hao Wang

Carl Vondrick

2023-07-03

Proceedings of the 40th International Conference on Machine Learning (publié)

Doubly Right Object Recognition: A Why Prompt for Visual Rationales

Revant Teotia

Amrutha Sundar

Sachit Menon

Junfeng Yang

Xin Wang

Carl Vondrick

Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In th… (voir plus)is paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a “doubly right” object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a “why prompt,” which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets.

2023-06-17

2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (publié)

arxiv.org

Robust Perception through Equivariance

Lingyu Zhang

Abhishek Vaibhav Joshi

Junfeng Yang

Hao Wang

Carl Vondrick

2023-04-24

ICML.cc/2023/Conference (poster)

Robustifying Language Models with Test-Time Adaptation

Noah Thomas McDermott

Junfeng Yang

Large-scale language models achieved state-of-the-art performance over a number of language tasks. However, they fail on adversarial languag… (voir plus)e examples, which are sentences optimized to fool the language models but with similar semantic meanings for humans. While prior work focuses on making the language model robust at training time, retraining for robustness is often unrealistic for large-scale foundation models. Instead, we propose to make the language models robust at test time. By dynamically adapting the input sentence with predictions from masked words, we show that we can reverse many language adversarial attacks. Since our approach does not require any training, it works for novel tasks at test time and can adapt to novel adversarial corruptions. Visualizations and empirical results on two popular sentence classification datasets demonstrate that our method can repair adversarial language attacks over 65% o

2023-03-04

ICLR.cc/2023/Workshop/Trustworthy_ML (poster)

Understanding Zero-shot Adversarial Robustness for Large-Scale Models

Scott Geng

Junfeng Yang

Xin Wang

Carl Vondrick

Pretrained large-scale vision-language models like CLIP have exhibited strong generalization over unseen tasks. Yet imperceptible adversaria… (voir plus)l perturbations can significantly reduce CLIP's performance on new tasks. In this work, we identify and explore the problem of adapting large-scale models for zero-shot adversarial robustness. We first identify two key factors during model adaption--training losses and adaptation methods--that affect the model's zero-shot adversarial robustness. We then propose a text-guided contrastive adversarial training loss, which aligns the text embeddings and the adversarial visual features with contrastive learning on a small set of training data. We apply this training loss to two adaption methods, model finetuning and visual prompt tuning. We find that visual prompt tuning is more effective in the absence of texts, while finetuning wins in the existence of text guidance. Overall, our approach significantly improves the zero-shot adversarial robustness over CLIP, seeing an average improvement of 31 points over ImageNet and 15 zero-shot datasets. We hope this work can shed light on understanding the zero-shot adversarial robustness of large-scale models.

2023-02-01

ICLR.cc/2023/Conference (poster)

Test-time Defense against Adversarial Attacks: Detection and Reconstruction of Adversarial Examples via Masked Autoencoder

Yun-Yun Tsai

Ju-Chin Chao

Albert Wen

Zhaoyuan Yang