TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories
Yuhe Jiang
Xun Deng
Jiacheng Yang
Honghua Dong
Gennady Pekhimenko
Fan Long
Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have … (see more)shown promise in code understanding, their type inference capabilities remain underexplored. We introduce `TypyBench`, a benchmark designed to evaluate LLMs' type inference across entire Python repositories. `TypyBench` features two novel metrics: `TypeSim`, which captures nuanced semantic relationships between predicted and ground truth types, and `TypeCheck`, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent `TypeSim` scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. `TypyBench` provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts.
Understanding (Un)Reliability of Steering Vectors in Language Models
Joschka Braun
Carsten Eickhoff
Seyed Ali Bahrainian
Dmitrii Krasheninnikov
Steering vectors are a lightweight method to control language model behavior by adding a learned bias to the activations at inference time. … (see more)Although steering demonstrates promising performance, recent work shows that it can be unreliable or even counterproductive in some cases. This paper studies the influence of prompt types and the geometry of activation differences on steering reliability. First, we find that all seven prompt types used in our experiments produce a net positive steering effect, but exhibit high variance across samples, and often give an effect opposite of the desired one. No prompt type clearly outperforms the others, and yet the steering vectors resulting from the different prompt types often differ directionally (as measured by cosine similarity). Second, we show that higher cosine similarity between training set activation differences predicts more effective steering. Finally, we observe that datasets where positive and negative activations are better separated are more steerable. Our results suggest that vector steering is unreliable when the target behavior is not represented by a coherent direction.
UNLEARNING GEO-CULTURAL STEREOTYPES IN MULTILINGUAL LLMS
Alireza Dehghanpour Farashah
Aditi Khandelwal
As multilingual generative models become more widely used, most safety and fairness evaluation techniques still focus on English-language re… (see more)sources, while overlooking important cross-cultural factors. This limitation raises concerns about fairness and safety, particularly regarding geoculturally situated stereotypes that hinder the models’ global inclusivity. In this work, we present preliminary findings on the impact of stereotype unlearning across languages, specifically in English, French, and Hindi. Using an adapted version of the SeeGULL dataset, we analyze how unlearning stereotypes in one language influences other languages within multilingual large language models. Our study evaluates two model families, Llama-3.1-8B and Aya-Expanse-8B, to assess whether unlearning in one linguistic context transfers across languages, potentially mitigating or exacerbating biases in multilingual settings.
WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation
Rabiul Awal
Mahsa Massoud
Zichao Li
Aarash Feizi
Suyuchen Wang
David Vazquez
Juan A. Rodriguez
Perouz Taslakian
Spandana Gella
Sai Rajeswar
Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (see more)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.
Automated diagnosis of usual interstitial pneumonia on chest CT via the mean curvature of isophotes
Peter Savadjiev
Morteza Rezanejad
Sahir Bhatnagar
David Camirand
Claude Kauffmann
Ronald J Dandurand
Patrick Bourgouin
Carl Chartrand-Lefebvre
Alexandre Semionov
AI Automatons: AI Systems Intended to Imitate Humans
Solon Barocas
Su Lin Blodgett
Lisa Egede
Alicia DeVrio
Myra Cheng
There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness -- systems we … (see more)dub AI automatons. Individuals, groups, or generic humans are being simulated to produce creative work in their styles, to respond to surveys in their places, to probe how they would use a new system before deployment, to provide users with assistance and companionship, and to anticipate their possible future behavior and interactions with others, just to name a few applications. The research, design, deployment, and availability of such AI systems have, however, also prompted growing concerns about a wide range of possible legal, ethical, and other social impacts. To both 1) facilitate productive discussions about whether, when, and how to design and deploy such systems, and 2) chart the current landscape of existing and prospective AI automatons, we need to tease apart determinant design axes and considerations that can aid our understanding of whether and how various design choices along these axes could mitigate -- or instead exacerbate -- potential adverse impacts that the development and use of AI automatons could give rise to. In this paper, through a synthesis of related literature and extensive examples of existing AI systems intended to mimic humans, we develop a conceptual framework to help foreground key axes of design variations and provide analytical scaffolding to foster greater recognition of the design choices available to developers, as well as the possible ethical implications these choices might have.
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Vaibhav Singh
Paul Janson
Paria Mehrbod
Adam Ibrahim
Benjamin Thérien
The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. Whi… (see more)le self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data streams without forgetting previously learned knowledge. Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual SSL methods. In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative. Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed iteration budget. For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks.
Considerations and recommendations from the ISMRM diffusion study group for preclinical diffusion MRI: Part 2-Ex vivo imaging: Added value and acquisition.
Kurt G Schilling
Francesco Grussu
Andrada Ianus
Brian Hansen
Amy F. D. Howard
Rachel L. C. Barrett
Fatima Nasrallah
Manisha Aggarwal
Stijn Michielse
Warda Syeda
Nian Wang
Andrew F. Bagdasarian
Jelle Veraart
Alard Roebroeck
Cornelius Eichner
Farshid Sepehrband
Jan Zimmermann
Lucas Soustelle
Christien Bowman
Benjamin C. Tendler … (see 38 more)
Andreea Hertanu
Ben Jeurissen
Marleen Verhoye
Lucio Frydman
Yohan van de Looij
David Hike
Jeff F. Dunn
Karla Miller
Bennett Landman
Noam Shemesh
Arthur Anderson
Emilie McKinnon
Shawna Farquharson
Mathieu D. Santin
Flavio Dell’Acqua
Carlo Pierpaoli
Samuel C. Grant
Ivana Drobnjak
Andre Obenaus
Alexander Leemans
Kevin D. Harkins
Maxime Descoteaux
Duan Xu
Hao Huang
Gene S. Kim
Dan Wu
Denis Le Bihan
Stephen J. Blackband
Matthew D. Budde
Luisa Ciobanu
Els Fieremans
Ruiliang Bai
Trygve B. Leergaard
Jiangyang Zhang
Tim B. Dyrby
G. Allan Johnson
Ileana O. Jelescu
The value of preclinical diffusion MRI (dMRI) is substantial. While dMRI enables in vivo non-invasive characterization of tissue, ex vivo d… (see more)MRI is increasingly being used to probe tissue microstructure and brain connectivity. Ex vivo dMRI has several experimental advantages including higher SNR and spatial resolution compared to in vivo studies, and enabling more advanced diffusion contrasts for improved microstructure and connectivity characterization. Another major advantage of ex vivo dMRI is the direct comparison with histological data, as a crucial methodological validation. However, there are a number of considerations that must be made when performing ex vivo experiments. The steps from tissue preparation, image acquisition and processing, and interpretation of results are complex, with many decisions that not only differ dramatically from in vivo imaging of small animals, but ultimately affect what questions can be answered using the data. This work represents "Part 2" of a three-part series of recommendations and considerations for preclinical dMRI. We describe best practices for dMRI of ex vivo tissue, with a focus on the value that ex vivo imaging adds to the field of dMRI and considerations in ex vivo image acquisition. We first give general considerations and foundational knowledge that must be considered when designing experiments. We briefly describe differences in specimens and models and discuss why some may be more or less appropriate for different studies. We then give guidelines for ex vivo protocols, including tissue fixation, sample preparation, and MR scanning. In each section, we attempt to provide guidelines and recommendations, but also highlight areas for which no guidelines exist (and why), and where future work should lie. An overarching goal herein is to enhance the rigor and reproducibility of ex vivo dMRI acquisitions and analyses, and thereby advance biomedical knowledge.
Considerations and recommendations from the <scp>ISMRM</scp> diffusion study group for preclinical diffusion <scp>MRI</scp>: Part 2—Ex vivo imaging: Added value and acquisition
Kurt G Schilling
Francesco Grussu
Andrada Ianus
Brian Hansen
Amy F. D. Howard
Rachel L. C. Barrett
Manisha Aggarwal
Stijn Michielse
Fatima Nasrallah
Warda Syeda
Nian Wang
Jelle Veraart
Alard Roebroeck
Andrew F. Bagdasarian
Cornelius Eichner
Farshid Sepehrband
Jan Zimmermann
Lucas Soustelle
Christien Bowman
Benjamin C. Tendler … (see 38 more)
Andreea Hertanu
Ben Jeurissen
Marleen Verhoye
Lucio Frydman
Yohan van de Looij
David Hike
Jeff F. Dunn
Karla Miller
Bennett Landman
Noam Shemesh
Arthur Anderson
Emilie McKinnon
Shawna Farquharson
Flavio Dell’Acqua
Carlo Pierpaoli
Ivana Drobnjak
Alexander Leemans
Kevin D. Harkins
Maxime Descoteaux
Duan Xu
Hao Huang
Mathieu D. Santin
Samuel C. Grant
Andre Obenaus
Gene S. Kim
Dan Wu
Denis Le Bihan
Stephen J. Blackband
Luisa Ciobanu
Els Fieremans
Ruiliang Bai
Trygve B. Leergaard
Jiangyang Zhang
Tim B. Dyrby
G. Allan Johnson
Matthew D. Budde
Ileana O. Jelescu
EarthView: A Large Scale Remote Sensing Dataset for Self-Supervision
Diego Velazquez
Pau Rodriguez
Sergio Alonso
Josep M. Gonfaus
Jordi Gonzalez
Gerardo Richarte
Javier Marin
Alexandre Lacoste
This paper presents EarthView, a comprehensive dataset specifically designed for self-supervision on remote sensing data, intended to enhanc… (see more)e deep learning applications on Earth monitoring tasks. The dataset spans 15 tera pixels of global remote-sensing data, combining imagery from a diverse range of sources, including NEON, Sentinel, and a novel release of 1m spatial resolution data from Satellogic. Our dataset provides a wide spectrum of image data with varying resolutions, harnessed from different sensors and organized coherently into an accessible HuggingFace dataset in parquet format. This data spans five years, from 2017 to 2022. Accompanying the dataset, we introduce EarthMAE, a tailored Masked Autoencoder, developed to tackle the distinct challenges of remote sensing data. Trained in a self-supervised fashion, EarthMAE effectively processes different data modalities such as hyperspectral, multispectral, topographical data, segmentation maps, and temporal structure. This model helps us show that pre-training on Satellogic data improves performance on downstream tasks. While there is still a gap to fill in MAE for heterogeneous data, we regard this innovative combination of an expansive, diverse dataset and a versatile model adapted for self-supervised learning as a stride forward in deep learning for Earth monitoring.
Feynman-Kac Correctors in Diffusion: Annealing, Guidance, and Product of Experts
Marta Skreta
Tara Akhound-Sadegh
Viktor Ohanesian
Roberto Bondesan
Alan Aspuru-Guzik
Arnaud Doucet
Rob Brekelmans
Alexander Tong
While score-based generative models are the model of choice across diverse domains, there are limited tools available for controlling infere… (see more)nce-time behavior in a principled manner, e.g. for composing multiple pretrained models. Existing classifier-free guidance methods use a simple heuristic to mix conditional and unconditional scores to approximately sample from conditional distributions. However, such methods do not approximate the intermediate distributions, necessitating additional 'corrector' steps. In this work, we provide an efficient and principled method for sampling from a sequence of annealed, geometric-averaged, or product distributions derived from pretrained score-based models. We derive a weighted simulation scheme which we call Feynman-Kac Correctors (FKCs) based on the celebrated Feynman-Kac formula by carefully accounting for terms in the appropriate partial differential equations (PDEs). To simulate these PDEs, we propose Sequential Monte Carlo (SMC) resampling algorithms that leverage inference-time scaling to improve sampling quality. We empirically demonstrate the utility of our methods by proposing amortized sampling via inference-time temperature annealing, improving multi-objective molecule generation using pretrained models, and improving classifier-free guidance for text-to-image generation. Our code is available at https://github.com/martaskrt/fkc-diffusion.
Hardware Synthesizable Exceptions using Continuations
Paul Teng