We use cookies to analyze the browsing and usage of our website and to personalize your experience. You can disable these technologies at any time, but this may limit certain functionalities of the site. Read our Privacy Policy for more information.
Setting cookies
You can enable and disable the types of cookies you wish to accept. However certain choices you make could affect the services offered on our sites (e.g. suggestions, personalised ads, etc.).
Essential cookies
These cookies are necessary for the operation of the site and cannot be deactivated. (Still active)
Analytics cookies
Do you accept the use of cookies to measure the audience of our sites?
Multimedia Player
Do you accept the use of cookies to display and allow you to watch the video content hosted by our partners (YouTube, etc.)?
Publications
WASH: Train your Ensemble with Communication-Efficient Weight Shuffling, then Average
The performance of deep neural networks is enhanced by ensemble methods, which average the output of several models. However, this comes at … (see more)an increased cost at inference. Weight averaging methods aim at balancing the generalization of ensembling and the inference speed of a single model by averaging the parameters of an ensemble of models. Yet, naive averaging results in poor performance as models converge to different loss basins, and aligning the models to improve the performance of the average is challenging. Alternatively, inspired by distributed training, methods like DART and PAPA have been proposed to train several models in parallel such that they will end up in the same basin, resulting in good averaging accuracy. However, these methods either compromise ensembling accuracy or demand significant communication between models during training. In this paper, we introduce WASH, a novel distributed method for training model ensembles for weight averaging that achieves state-of-the-art image classification accuracy. WASH maintains models within the same basin by randomly shuffling a small percentage of weights during training, resulting in diverse models and lower communication costs compared to standard parameter averaging methods.
Current LLM training positions mathematical reasoning as a core capability. With publicly available sources fully tapped, there is unmet dem… (see more)and for diverse and challenging math questions. Relying solely on human experts is both time-consuming and costly, while LLM-generated questions often lack the requisite diversity and difficulty. We present a design framework that combines the strengths of LLMs with a human-in-the-loop approach to generate a diverse array of challenging math questions. We leverage LLM metacognition skills [Didolkar et al., 2024] of a strong LLM to extract core"skills"from existing math datasets. These skills serve as the basis for generating novel and difficult questions by prompting the LLM with random pairs of core skills. The use of two different skills within each question makes finding such questions an"out of distribution"task for both LLMs and humans. Our pipeline employs LLMs to iteratively generate and refine questions and solutions through multiturn prompting. Human annotators then verify and further refine the questions, with their efficiency enhanced via further LLM interactions. Applying this pipeline on skills extracted from the MATH dataset [Hendrycks et al., 2021] resulted in MATH
Sparse autoencoders (SAEs) have been central to the effort of finding interpretable and disentangled directions of representation spaces in … (see more)neural networks, in both image and text domains. While the efficacy and pitfalls of this method in the vision domain are well-studied, there is a lack of corresponding results, both qualitative and quantitative, for the text domain. We define and train language models on a set of formal grammars, and train SAEs on the latent representations of these models under a wide variety of hyperparameter settings. We identify several interpretable latents in the SAEs, and formulate a scaling law defining the relationship between the reconstruction loss of SAEs and their hidden size. We show empirically that the presence of latents correlating to certain features of the input does not imply a causal function in the computation and that the performance of SAEs is highly sensitive to inductive biases.
Scientific research often seeks to understand the causal structure underlying high-level variables in a system. For example, climate scienti… (see more)sts study how phenomena, such as El Ni\~no, affect other climate processes at remote locations across the globe. However, scientists typically collect low-level measurements, such as geographically distributed temperature readings. From these, one needs to learn both a mapping to causally-relevant latent variables, such as a high-level representation of the El Ni\~no phenomenon and other processes, as well as the causal model over them. The challenge is that this task, called causal representation learning, is highly underdetermined from observational data alone, requiring other constraints during learning to resolve the indeterminacies. In this work, we consider a temporal model with a sparsity assumption, namely single-parent decoding: each observed low-level variable is only affected by a single latent variable. Such an assumption is reasonable in many scientific applications that require finding groups of low-level variables, such as extracting regions from geographically gridded measurement data in climate research or capturing brain regions from neural activity data. We demonstrate the identifiability of the resulting model and propose a differentiable method, Causal Discovery with Single-parent Decoding (CDSD), that simultaneously learns the underlying latents and a causal graph over them. We assess the validity of our theoretical results using simulated data and showcase the practical validity of our method in an application to real-world data from the climate science field.
A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desir… (see more)ed behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up"and ``top-down"-- have been presented, but there has been little quantitative comparison between them. We present a case study comparing the effectiveness of representative vector steering methods from each branch: function vectors (FV; arXiv:2310.15213), as a bottom-up method, and in-context vectors (ICV; arXiv:2311.06668) as a top-down method. While both aim to capture compact representations of broad in-context learning tasks, we find they are effective only on specific types of tasks: ICVs outperform FVs in behavioral shifting, whereas FVs excel in tasks requiring more precision. We discuss the implications for future evaluations of steering methods and for further research into top-down and bottom-up steering given these findings.
Large Language Models have been extensively studied for their vulnerabilities, particularly in the context of adversarial attacks. However, … (see more)the emergence of Vision Language Models introduces new modalities of risk that have not yet been thoroughly explored, especially when processing multiple images simultaneously. In this paper, we introduce two black-box jailbreak methods that leverage multi-image inputs to uncover vulnerabilities in these models. We present a new safety evaluation dataset for multimodal LLMs called MultiBench, which is composed of these jailbreak methods. These methods can easily be applied and evaluated using our toolkit. We test these methods against six safety aligned frontier models from Google, OpenAI, and Anthropic, revealing significant safety vulnerabilities. Our findings suggest that even the most powerful language models remain vulnerable against compositional adversarial attacks, specifically those composed of multiple images.
Advances in Large Language Models (LLMs) have spurred a wave of LLM library learning systems for mathematical reasoning. These systems aim … (see more)to learn a reusable library of *tools*, such as formal Isabelle lemmas or Python programs that are tailored to a family of tasks. Many of these systems are inspired by the human structuring of knowledge into reusable and extendable concepts, but do current methods actually learn reusable libraries of tools?
We study two library learning systems for mathematics which both reported increased accuracy: LEGO-Prover and TroVE. We find that function reuse is extremely infrequent on miniF2F and MATH. Our followup ablation experiments suggest that, rather than reuse, self-correction and self-consistency are the primary drivers of the observed performance gains.
This study investigates the application of human psychometric assessments to large language models (LLMs) to examine their consistency and m… (see more)alleability in exhibiting personality traits. We administered the Big Five Inventory (BFI) and the Eysenck Personality Questionnaire-Revised (EPQ-R) to various LLMs across different model sizes and persona prompts. Our results reveal substantial variability in responses due to question order shuffling, challenging the notion of a stable LLM "personality." Larger models demonstrated more consistent responses, while persona prompts significantly influenced trait scores. Notably, the assistant persona led to more predictable scaling, with larger models exhibiting more socially desirable and less variable traits. In contrast, non-conventional personas displayed unpredictable behaviors, sometimes extending personality trait scores beyond the typical human range. These findings have important implications for understanding LLM behavior under different conditions and reflect on the consequences of scaling.
This study investigates the application of human psychometric assessments to large language models (LLMs) to examine their consistency and m… (see more)alleability in exhibiting personality traits. We administered the Big Five Inventory (BFI) and the Eysenck Personality Questionnaire-Revised (EPQ-R) to various LLMs across different model sizes and persona prompts. Our results reveal substantial variability in responses due to question order shuffling, challenging the notion of a stable LLM "personality." Larger models demonstrated more consistent responses, while persona prompts significantly influenced trait scores. Notably, the assistant persona led to more predictable scaling, with larger models exhibiting more socially desirable and less variable traits. In contrast, non-conventional personas displayed unpredictable behaviors, sometimes extending personality trait scores beyond the typical human range. These findings have important implications for understanding LLM behavior under different conditions and reflect on the consequences of scaling.
Research and industry are rapidly advancing the innovation and adoption of foundation model-based systems, yet the tools for managing these … (see more)models have not kept pace. Understanding the provenance and lineage of models is critical for researchers, industry, regulators, and public trust. While model cards and system cards were designed to provide transparency, they fall short in key areas: tracing model genealogy, enabling machine readability, offering reliable centralized management systems, and fostering consistent creation incentives. This challenge mirrors issues in software supply chain security, but AI/ML remains at an earlier stage of maturity. Addressing these gaps requires industry-standard tooling that can be adopted by foundation model publishers, open-source model innovators, and major distribution platforms. We propose a machine-readable model specification format to simplify the creation of model records, thereby reducing error-prone human effort, notably when a new model inherits most of its design from a foundation model. Our solution explicitly traces relationships between upstream and downstream models, enhancing transparency and traceability across the model lifecycle. To facilitate the adoption, we introduce the unified model record (UMR) repository , a semantically versioned system that automates the publication of model records to multiple formats (PDF, HTML, LaTeX) and provides a hosted web interface (https://modelrecord.com/). This proof of concept aims to set a new standard for managing foundation models, bridging the gap between innovation and responsible model management.