George Adamopoulos

Jonathan Siu Chi Lim

Quentin Gregory Anthony

Edwin Fennell

The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (voir plus)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

2025-01-16

ArXiv (prépublication)

CHIRP: A Fine-Grained Benchmark for Open-Ended Response Evaluation in Vision-Language Models

Alexis Roger

Prateek Humane

Daniel Z Kaplan

Kshitij Gupta

Qirui Sun

Jonathan Siu Chi Lim

Quentin Gregory Anthony

Edwin Fennell

The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (voir plus)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

2025-01-16

ArXiv (prépublication)

Robin: a Suite of Multi-Scale Vision-Language Models and the CHIRP Evaluation Benchmark

Alexis Roger

Prateek Humane

Daniel Z Kaplan

Kshitij Gupta

Qirui Sun

Jonathan Siu Chi Lim

Quentin Gregory Anthony

Edwin Fennell

The proliferation of Vision-Language Models (VLMs) in the past several years calls for rigorous and comprehensive evaluation methods and ben… (voir plus)chmarks. This work analyzes existing VLM evaluation techniques, including automated metrics, AI-based assessments, and human evaluations across diverse tasks. We first introduce Robin - a novel suite of VLMs that we built by combining Large Language Models (LLMs) and Vision Encoders (VEs) at multiple scales, and use Robin to identify shortcomings of current evaluation approaches across scales. Next, to overcome the identified limitations, we introduce CHIRP - a new long form response benchmark we developed for more robust and complete VLM evaluation. We provide open access to the Robin training code, model suite, and CHIRP benchmark to promote reproducibility and advance VLM research.

2025-01-16

ArXiv (prépublication)

Interpretability in Action: Exploratory Analysis of VPT, a Minecraft Agent

Karolis Jucys

Mehrab Hamidi

Stephanie Milani

Mohammad Reza Samsami

Özgür Şimşek

Understanding the mechanisms behind decisions taken by large foundation models in sequential tasks is critical to ensuring that such systems… (voir plus) operate transparently and safely. However, interpretability methods have not yet been applied extensively to large-scale agents based on reinforcement learning. In this work, we perform exploratory analysis on the Video PreTraining (VPT) Minecraft playing agent, one of the largest open-source vision-based agents. We try to illuminate its reasoning mechanisms by applying various interpretability techniques. First, we analyze the attention mechanism while the agent solves its training task --- crafting a diamond pickaxe. The agent seems to pay attention to the 4 last frames and several key-frames further back. This provides clues as to how it maintains coherence in the task that takes 3-10 minutes, despite the agent's short memory span of only six seconds. Second, we perform various interventions, which help us uncover a worrying case of goal misgeneralization: VPT mistakenly identifies a villager wearing brown clothes as a tree trunk and punches it to death, when positioned stationary under green tree leaves. We demonstrate similar misbehavior in a related agent (STEVE-1), which motivates the use of VPT as a model organism for large-scale vision-based agent interpretability.

2024-06-24

ICML.cc/2024/Workshop/MI (poster)

openreview.net

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Biloš

Hena Ghonia

Nadhir Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (voir plus)Llama, a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen "out-of-distribution" time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws to fit and predict model scaling behavior. The open source code is made available at https://github.com/kashif/pytorch-transformer-ts.

2023-11-01

NeurIPS.cc/2023/Workshop/R0-FoMo (poster)

openreview.net

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Bilovs

Hena Ghonia

Nadhir Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

2023-10-12

ArXiv (prépublication)

Lag-Llama: Towards Foundation Models for Probabilistic Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Bilovs

Hena Ghonia

N. Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Over the past years, foundation models have caused a paradigm shift in machine learning due to their unprecedented capabilities for zero-sho… (voir plus)t and few-shot generalization. However, despite the success of foundation models in modalities such as natural language processing and computer vision, the development of foundation models for time series forecasting has lagged behind. We present Lag-Llama, a general-purpose foundation model for univariate probabilistic time series forecasting based on a decoder-only transformer architecture that uses lags as covariates. Lag-Llama is pretrained on a large corpus of diverse time series data from several domains, and demonstrates strong zero-shot generalization capabilities compared to a wide range of forecasting models on downstream datasets across domains. Moreover, when fine-tuned on relatively small fractions of such previously unseen datasets, Lag-Llama achieves state-of-the-art performance, outperforming prior deep learning approaches, emerging as the best general-purpose model on average. Lag-Llama serves as a strong contender to the current state-of-art in time series forecasting and paves the way for future advancements in foundation models tailored to time series data.

2023-10-12

ArXiv (prépublication)

Lag-Llama: Towards Foundation Models for Time Series Forecasting

Kashif Rasul

Andrew Robert Williams

Marin Biloš

Hena Ghonia

N. Hassen

Anderson Schneider

Sahil Garg

Yuriy Nevmyvaka

Aiming to build foundation models for time-series forecasting and study their scaling behavior, we present here our work-in-progress on Lag-… (voir plus)Llama , a general-purpose univariate probabilistic time-series forecasting model trained on a large collection of time-series data. The model shows good zero-shot prediction capabilities on unseen “out-of-distribution” time-series datasets, outperforming supervised baselines. We use smoothly broken power-laws [7] to fit and predict model scaling behavior. The open source code is made available at https://github

2023-01-01

arXiv.org (prépublication)