Rishika Bhagwatkar

Indirect Prompt Injections: Are Firewalls All You Need, or Stronger Benchmarks?

Kevin Kasa

Graham W. Taylor

Krishnamurthy Dj Dvijotham

Alexandre Lacoste

AI agents are vulnerable to indirect prompt injection attacks, where malicious instructions embedded in external content or tool outputs cau… (voir plus)se unintended or harmful behavior. Inspired by the well-established concept of firewalls, we show that a simple, modular and model-agnostic defense operating at the agent--tool interface achieves perfect security (0% or the lowest possible attack success rate) with high utility (task success rate) across four public benchmarks: AgentDojo, Agent Security Bench, InjecAgent and tau-Bench, while achieving a state-of-the-art security-utility tradeoff compared to prior results. Specifically, we employ a defense based on two firewalls: a Tool-Input Firewall (Minimizer) and a Tool-Output Firewall (Sanitizer). Unlike prior complex approaches, this firewall defense makes minimal assumptions on the agent and can be deployed out-of-the-box, while maintaining strong performance without compromising utility. However, our analysis also reveals critical limitations in these existing benchmarks, including flawed success metrics, implementation bugs, and most importantly, weak attacks, hindering significant progress in the field. To foster more meaningful progress, we present targeted fixes to these issues for AgentDojo and Agent Security Bench while proposing best-practices for more robust benchmark design. Further, we demonstrate that although these firewalls push the state-of-the-art on existing benchmarks, it is still possible to bypass them in practice, underscoring the need to incorporate stronger attacks in security benchmarks. Overall, our work shows that existing agentic security benchmarks are easily saturated by a simple approach and highlights the need for stronger agentic security benchmarks with carefully chosen evaluation metrics and strong adaptive attacks.

2025-10-06

ArXiv (prépublication)

A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy

Maxime Heuillet

Jonas Ngnawe

Yann Batiste Pequignot

Ola Ahmad

Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations … (voir plus)was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model's ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

2025-09-29

NeurIPS.cc/2025/Workshop/Reliable_ML (publié)

A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy

Maxime Heuillet

Jonas Ngnawe

Yann Batiste Pequignot

Ola Ahmad

Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations … (voir plus)was pursued by training models from scratch (i.e., with random initializations) using specialized loss ob- jectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize pre- dictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model’s ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with signif- icant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation proto- cols — yielding 1, 440 training configurations and 7, 200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

2025-09-29

NeurIPS.cc/2025/Workshop/Reliable_ML (publié)

A Guide to Robust Generalization: The Impact of Architecture, Pre-training, and Optimization Strategy

Maxime Heuillet

Jonas Ngnawe

Yann Batiste Pequignot

Ola Ahmad

Deep learning models operating in the image domain are vulnerable to small input perturbations. For years, robustness to such perturbations … (voir plus)was pursued by training models from scratch (i.e., with random initializations) using specialized loss objectives. Recently, robust fine-tuning has emerged as a more efficient alternative: instead of training from scratch, pretrained models are adapted to maximize predictive performance and robustness. To conduct robust fine-tuning, practitioners design an optimization strategy that includes the model update protocol (e.g., full or partial) and the specialized loss objective. Additional design choices include the architecture type and size, and the pretrained representation. These design choices affect robust generalization, which is the model's ability to maintain performance when exposed to new and unseen perturbations at test time. Understanding how these design choices influence generalization remains an open question with significant practical implications. In response, we present an empirical study spanning 6 datasets, 40 pretrained architectures, 2 specialized losses, and 3 adaptation protocols, yielding 1,440 training configurations and 7,200 robustness measurements across five perturbation types. To our knowledge, this is the most diverse and comprehensive benchmark of robust fine-tuning to date. While attention-based architectures and robust pretrained representations are increasingly popular, we find that convolutional neural networks pretrained in a supervised manner on large datasets often perform best. Our analysis both confirms and challenges prior design assumptions, highlighting promising research directions and offering practical guidance.

2025-08-12

ArXiv (prépublication)

Towards Adversarially Robust Vision-Language Models: Insights from Design Choices and Prompt Formatting Techniques

Daniel Z Kaplan

Vision-Language Models (VLMs) have witnessed a surge in both research and real-world applications. However, as they becoming increasingly pr… (voir plus)evalent, ensuring their robustness against adversarial attacks is paramount. This work systematically investigates the impact of model design choices on the adversarial robustness of VLMs against image-based attacks. Additionally, we introduce novel, cost-effective approaches to enhance robustness through prompt formatting. By rephrasing questions and suggesting potential adversarial perturbations, we demonstrate substantial improvements in model robustness against strong image-based attacks such as Auto-PGD. Our findings provide important guidelines for developing more robust VLMs, particularly for deployment in safety-critical environments.

2024-06-28

ICML.cc/2024/Workshop/NextGenAISafety (poster)

Improving Adversarial Robustness in Vision-Language Models with Architecture and Prompt Design.

2024-01-01

EMNLP (Findings) (publié)

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Biloš

Hena Ghonia

Nadhir Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (voir plus)Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.

2023-11-01

NeurIPS.cc/2023/Workshop/R0-FoMo (poster)

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Bilovs

Hena Ghonia

Nadhir Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

2023-10-12

ArXiv (prépublication)

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Bilovs

Hena Ghonia

N. Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-sho… (voir plus)t and few-shot generalization. However, despite the success of foundation models in modalities such as natural language processing and computer vision, the development of foundation models for time series forecasting has lagged behind. We present Lag-Llama, a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates. Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities compared to a wide range of forecasting models on downstream datasets across domains. Moreover, when fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance, outperforming prior deep learning approaches, emerging as the best general-purpose model on average. Lag-Llama serves as a strong contender to the current state-of-art in time series forecasting and paves the way for future advancements in foundation models tailored to time series data.

2023-10-12

ArXiv (prépublication)

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Biloš

Hena Ghonia

N. Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (voir plus)Llama , a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen “out-of-distribution” time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws [7] to fit and predict model scaling behavior. The open source code is made available at https://github

2023-01-01

arXiv.org (prépublication)