Publications

INViTE: INterpret and Control Vision-Language Models with Text Explanations

Haozhe Chen

Junfeng Yang

Carl Vondrick

Chengzhi Mao

Columbia University

M. University

Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to the… (voir plus)ir black-box nature, understanding the underlying rules behind these models’ predictions and controlling model behaviors have remained open challenges. We present INViTE: a framework for INterpreting Vision Transformer’s latent tokens with Text Explanations. Given a latent token, INViTE retains its semantic information to the final layer using transformer’s local operations and retrieves the closest text for explanation. INViTE enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, INViTE allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations. Our code is available at https://github.com/tonychenxyz/vit-interpret.

2023-12-31

International Conference on Learning Representations (publié)

openreview.net

ÌròyìnSpeech: A multi-purpose Yorùbá Speech Corpus

Tolúlope' Ògúnremí

Kọ́lá Túbọ̀sún

Aremu Anuoluwapo

Iroro Orife

David Ifeoluwa Adelani

2023-12-31

LREC/COLING (publié)

doi.org

arxiv.org

iWISDM: Assessing instruction following in multimodal models at scale

Xiaoxuan Lei

Lucas Gomez

Hao Yuan Bai

Pouya Bashivan

The ability to perform complex tasks from detailed instructions is a key to the remarkable achievements of our species. As humans, we are no… (voir plus)t only capable of performing a wide variety of tasks but also very complex ones that may entail hundreds or thousands of steps to complete. Large language models and their more recent multimodal counterparts that integrate textual and visual inputs have achieved unprecedented success in performing complex tasks. Yet, most existing benchmarks are largely confined to single-modality inputs — either text or vision — and thus, narrowing the scope of multimodal integration assessments, particularly for instruction-following in multimodal contexts. To bridge this gap, we introduce the instructed-Virtual VISual Decision Making (iWISDM) environment engineered to generate a limitless array of vision-language tasks of varying complexity. Using iWISDM, we compiled three distinct benchmarks of instruction following visual tasks across varying complexity levels and evaluated several newly developed multimodal models on these benchmarks. Our findings establish iWISDM as a robust benchmark for assessing the instructional adherence of both existing and emergent multimodal models and highlight a large gap in these models’ ability to precisely follow instructions.

2023-12-31

CoLLAs (publié)

doi.org

proceedings.mlr.press

Joint Multimodal Transformer for Dimensional Emotional Recognition in the Wild

Paul Waligora

Muhammad Osama Zeeshan

Muhammad Haseeb Aslam

Soufiane Belharbi

Alessandro Lameiras Koerich

Marco Pedersoli

Simon Bacon

Eric Granger

Audiovisual emotion recognition (ER) in videos has immense potential over unimodal performance. It effectively leverages the inter-and intra… (voir plus)-modal dependencies between visual and auditory modalities. This work proposes a novel audio-visual emotion recognition system utilizing a joint multimodal transformer architecture with key-based cross-attention. This framework aims to exploit the complementary nature of audio and visual cues (facial expressions and vocal patterns) in videos, leading to superior performance compared to solely relying on a single modality. The proposed model leverages separate backbones for capturing intra-modal temporal dependencies within each modality (audio and visual). Subse-quently, a joint multimodal transformer architecture integrates the individual modality embeddings, enabling the model to effectively capture inter-modal (between audio and visual) and intra-modal (within each modality) relationships. Extensive evaluations on the challenging Affwild2 dataset demonstrate that the proposed model significantly outperforms baseline and state-of-the-art methods in ER tasks.

2023-12-31

arXiv.org (prépublication)

doi.org

KD-LoRA: A Hybrid Approach to Efficient Fine-Tuning with LoRA and Knowledge Distillation

Rambod Azimi

Rishav

Marek Teichmann

S Ebrahimi Kahou

2023-12-31

ENLSP (publié)

doi.org

proceedings.mlr.press

Do Large Language Models Know How Much They Know?

Gabriele Prato

Jerry Huang

Prasanna Parthasarathi

Shagun Sodhani

A. Chandar

2023-12-31

EMNLP (publié)

doi.org

arxiv.org

Layerwise Proximal Replay: A Proximal Point Method for Online Continual Learning

Jinsoo Yoo

Yunpeng Liu

Frank N. Wood

Geoff Pleiss

2023-12-31

ICML (publié)

doi.org

proceedings.mlr.press

Learnable Filters for Geometric Scattering Modules

Alexander Tong

Frederik Wenkel

Dhananjay Bhaskar

Kincaid MacDonald

Jackson Grady

Michael Perlmutter

Smita Krishnaswamy

Guy Wolf

2023-12-31

IEEE Transactions on Signal Processing (publié)

doi.org

arxiv.org

Learning conditional policies for crystal design using offline reinforcement learning

Prashant Govindarajan

Santiago Miret

Jarrid Rector-Brooks

Mariano Phielipp

Janarthanan Rajendran

Sarath Chandar

Conservative Q-learning for band-gap conditioned crystal design with DFT evaluations – the model is trained on trajectories constructed fr… (voir plus)om crystals in the Materials Project. Results indicate promising performance for lower band gap targets.

2023-12-31

Digital Discovery (publié)

doi.org

openreview.net

Learning Lagrangian Multipliers for the Travelling Salesman Problem

Augustin Parjadis

Quentin Cappart

Bistra Dilkina

Aaron M. Ferber

Louis-Martin Rousseau

2023-12-31

CP (publié)

doi.org

arxiv.org

Learning Precedences for Scheduling Problems with Graph Neural Networks

Hélène Verhaeghe

Quentin Cappart

Gilles Pesant

Claude-Guy Quimper

2023-12-31

CP (publié)

doi.org

Learning to repeatedly solve routing problems

Mouad Morabit

Guy Desaulniers

Andrea Lodi

In the last years, there has been a great interest in machine‐learning‐based heuristics for solving NP‐hard combinatorial optimization… (voir plus) problems. The developed methods have shown potential on many optimization problems. In this paper, we present a learned heuristic for the reoptimization of a problem after a minor change in its data. We focus on the case of the capacited vehicle routing problem with static clients (i.e., same client locations) and changed demands. Given the edges of an original solution, the goal is to predict and fix the ones that have a high chance of remaining in an optimal solution after a change of client demands. This partial prediction of the solution reduces the complexity of the problem and speeds up its resolution, while yielding a good quality solution. The proposed approach resulted in solutions with an optimality gap ranging from 0% to 1.7% on different benchmark instances within a reasonable computing time.

2023-12-31

Networks (publié)

doi.org

arxiv.org

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Publications

TRAIL : IA responsable pour les professionnels et les leaders

Fondateur en résidence Mila Ventures

Avantage IA : productivité dans la fonction publique

Mots-clés populaires:

Publications