Publications

NNetNav: Unsupervised Learning of Browser Agents Through Environment Interaction in the Wild

Shikhar Murty

Hao Zhu

Dzmitry Bahdanau

Christopher D Manning

We introduce NNetNav, a method for unsupervised interaction with websites that generates synthetic demonstrations for training browser agent… (see more)s. Given any website, NNetNav produces these demonstrations by retroactively labeling action sequences from an exploration policy. Most work on training browser agents has relied on expensive human supervision, and the limited prior work on such interaction-based techniques has failed to provide effective search through the exponentially large space of exploration. In contrast, NNetNav exploits the hierarchical structure of language instructions to make this search more tractable: Complex instructions are typically decomposable into simpler sub-tasks, allowing NNetNav to automatically prune interaction episodes when an intermediate trajectory cannot be annotated with a meaningful sub-task. \texttt{LLama-3.1-8b} finetuned on 10k NNetNav self-generated demonstrations obtains over 16\% success rate on WebArena, and 35\% on WebVoyager, an improvement of 15pts and 31pts respectively over zero-shot \texttt{LLama-3.1-8b}, outperforming zero-shot GPT-4 and reaching the state-of-the-art among unsupervised methods, for both benchmarks.

2025-03-07

ICLR.cc/2025/Workshop/SSI-FM (poster)

Towards Graph Foundation Models: A Study on the Generalization of Positional and Structural Encodings

Billy Joe Franks

Moshe Eliasof

Semih Cantürk

Guy Wolf

Carola-Bibiane Schönlieb

Sophie Fellenz

Marius Kloft

Recent advances in integrating positional and structural encodings (PSEs) into graph neural networks (GNNs) have significantly enhanced thei… (see more)r performance across various graph learning tasks. However, the general applicability of these encodings and their potential to serve as foundational representations for graphs remain uncertain. This paper investigates the fine-tuning efficiency, scalability with sample size, and generalization capability of learnable PSEs across diverse graph datasets. Specifically, we evaluate their potential as universal pre-trained models that can be easily adapted to new tasks with minimal fine-tuning and limited data. Furthermore, we assess the expressivity of the learned representations, particularly, when used to augment downstream GNNs. We demonstrate through extensive benchmarking and empirical analysis that PSEs generally enhance downstream models. However, some datasets may require specific PSE-augmentations to achieve optimal performance. Nevertheless, our findings highlight their significant potential to become integral components of future graph foundation models. We provide new insights into the strengths and limitations of PSEs, contributing to the broader discourse on foundation models in graph learning.

2025-03-07

TMLR (accepted)

Tractable Representations for Convergent Approximation of Distributional HJB Equations

Julie Alhosh

Harley Wiltzer

David Meger

2025-03-07

ArXiv (preprint)

Attention-based Class-Conditioned Alignment for Multi-Source Domain Adaptation of Object Detectors

Atif Belal

Akhil Meethal

Francisco Perdigon Romero

Marco Pedersoli

Eric Granger

Domain adaptation methods for object detection (OD) strive to mitigate the impact of distribution shifts by promoting feature alignment acro… (see more)ss source and target domains. Multi-source domain adaptation (MSDA) allows leveraging multiple annotated source datasets and unlabeled target data to improve the accuracy and robustness of the detection model. Most state-of-the-art MSDA methods for OD perform feature alignment in a class-agnostic manner. This is challenging since the objects have unique modality information due to variations in object appearance across domains. A recent prototype-based approach proposed a class-wise alignment, yet it suffers from error accumulation caused by noisy pseudo-labels that can negatively affect adaptation with imbalanced data. To overcome these limitations, we propose an attention-based class-conditioned alignment method for MSDA, designed to align instances of each object category across domains. In particular, an attention module combined with an adversarial domain classifier allows learning domain-invariant and class-specific instance representations. Experimental results on multiple benchmarking MSDA datasets indicate that our method outperforms state-of-the-art methods and exhibits robustness to class imbalance, achieved through a conceptually simple class-conditioning strategy. Our code is available at: https://github.com/imatif17/ACIA.

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

Continual Pre-training of MoEs: How robust is your router?

Benjamin Therien

Charles-Etienne Joseph

Zain Sarwar

Ashwinee Panda

Anirban Das

Shi-Xiong Zhang

Stephen Rawls

Sambit Sahu

Eugene Belilovsky

Irina Rish

2025-03-06

ArXiv (preprint)

Cross-Task Affinity Learning for Multitask Dense Scene Predictions

Dimitrios Sinodinos

Narges Armanfard

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

Cross-validation for training and testing co-occurrence network inference algorithms

Daniel Agyapong

Jeffrey Ryan Propster

Jane Marks

Toby Dylan Hocking

2025-03-06

BMC Bioinformatics (published)

GradTune: Last-layer Fine-tuning for Group Robustness Without Group Annotation

Patrik Joslin Kenfack

Ulrich Aivodji

Samira Ebrahimi Kahou

This work addresses the limitations of deep neural networks (DNNs) in generalizing beyond training data due to spurious correlations. Recent… (see more) research has demonstrated that models trained with empirical risk minimization learn both core and spurious features, often upweighting spurious ones in the final classification, which can frequently lead to poor performance on minority groups. Deep Feature Reweighting alleviates this issue by retraining the model's last classification layer using a group-balanced held-out validation set. However, relying on spurious feature labels during training or validation limits practical application, as spurious features are not always known or costly to annotate. Our preliminary experiments reveal that ERM-trained models exhibit higher gradient norms on minority group samples in the hold-out dataset. Leveraging these insights, we propose an alternative approach called GradTune, which fine-tunes the last classification layer using high-gradient norm samples. Our results on four well-established benchmarks demonstrate that the proposed method can achieve competitive performance compared to existing methods without requiring group labels during training or validation.

2025-03-06

ICLR.cc/2025/Workshop/SCSL (published)

GradTune: Last-layer Fine-tuning for Group Robustness Without Group Annotation

Patrik Joslin Kenfack

Ulrich Aivodji

Samira Ebrahimi Kahou

This work addresses the limitations of deep neural networks (DNNs) in generalizing beyond training data due to spurious correlations. Recent… (see more) research has demonstrated that models trained with empirical risk minimization learn both core and spurious features, often upweighting spurious ones in the final classification, which can frequently lead to poor performance on minority groups. Deep Feature Reweighting alleviates this issue by retraining the model's last classification layer using a group-balanced held-out validation set. However, relying on spurious feature labels during training or validation limits practical application, as spurious features are not always known or costly to annotate. Our preliminary experiments reveal that ERM-trained models exhibit higher gradient norms on minority group samples in the hold-out dataset. Leveraging these insights, we propose an alternative approach called GradTune, which fine-tunes the last classification layer using high-gradient norm samples. Our results on four well-established benchmarks demonstrate that the proposed method can achieve competitive performance compared to existing methods without requiring group labels during training or validation.

2025-03-06

ICLR.cc/2025/Workshop/SCSL (published)

Graph-Jigsaw Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection

Ali Karami

Thi Kieu Khanh Ho

Narges Armanfard

2025-03-06

2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (published)

A Joint Space-Time Encoder for Geographic Time-Series Data

David Mickisch

Konstantin Klemmer

Mélisande Teng

David Rolnick

Many real-world processes are characterized by complex spatio-temporal dependencies, from climate dynamics to disease spread. Here, we intro… (see more)duce a new neural network architecture to model such dynamics at scale: the \emph{Space-Time Encoder}. Building on recent advances in \emph{location encoders}, models that take as inputs geographic coordinates, we develop a method that takes in geographic and temporal information simultaneously and learns smooth, continuous functions in both space and time. The inputs are first transformed using positional encoding functions and then fed into neural networks that allow the learning of complex functions. We implement a prototype of the \emph{Space-Time Encoder}, discuss the design choices of the novel temporal encoding, and demonstrate its utility in climate model emulation. We discuss the potential of the method across use cases, as well as promising avenues for further methodological innovation.

2025-03-06

ICLR.cc/2025/Workshop/MLMP (poster)