Portrait of Oleksiy Ostapenko

Oleksiy Ostapenko

Collaborating Alumni - Université de Montréal
Supervisor
Research Topics
Deep Learning
Natural Language Processing
Online Learning

Publications

Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts
Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (see more)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.
Exploring Sparse Adapters for Scalable Merging of Parameter Efficient Experts
Merging parameter-efficient task experts has recently gained growing attention as a way to build modular architectures that can be rapidly a… (see more)dapted on the fly for specific downstream tasks, without requiring additional fine-tuning. Typically, LoRA (Low-Rank Adaptation) serves as the foundational building block of such parameter-efficient modular architectures, leveraging low-rank weight structures to reduce the number of trainable parameters. In this paper, we study the properties of sparse adapters, which train only a subset of weights in the base neural network, as potential building blocks of modular architectures. First, we propose a simple method for training highly effective sparse adapters, which is conceptually simpler than existing methods in the literature and surprisingly outperforms both LoRA and full fine-tuning in our setting. Next, we investigate the merging properties of these sparse adapters by merging adapters for up to 20 natural language processing tasks, thus scaling beyond what is usually studied in the literature. Our findings demonstrate that sparse adapters yield superior in-distribution performance post-merging compared to LoRA or full model merging. Achieving strong held-out performance remains a challenge for all methods considered.
Guiding Language Model Reasoning with Planning Tokens
Xinyi Wang
Lucas Caccia
Xingdi Yuan
William Yang Wang
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as cha… (see more)in-of-thought (CoT) reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. To encourage a more structural generation of CoT steps, we propose a hierarchical generation scheme: we let the LM generate a planning token at the start of each reasoning step, intuitively serving as a high-level plan of the current step, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets and one multihop QA dataset with respect to standard fine-tuning baselines.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Edoardo Ponti
Matheus Pereira
Lucas Caccia
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Edoardo Ponti
Matheus Pereira
Lucas Caccia
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trai… (see more)ned adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.
Challenging Common Assumptions about Catastrophic Forgetting and Knowledge Accumulation
Timothee LESORT
Pau Rodriguez
Md Rifat Arefin
A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts
We study the applicability of mixture of parameter-efficient experts (MoPEs) for instruction-tuning large decoder-only language models. Rece… (see more)nt literature indicates that MoPEs might enhance performance in specific multi-task instruction-following datasets. In this paper, we extend such previous results and study applicability of MoPEs in settings previously overlooked: a) with open-domain instruction-following datasets; b) with recent decoder-only models and c) with downstream out-of-distribution test sets. We build on top of LLaMA1-13B/-7B and LLaMA2-13B. We study different variants of learned routing, namely per-example routing ([PE]), and a more expensive per-token ([PT]) routing. Overall, we are unable to substantiate strong performance gains observed in related studies in our setting. We observe occasional enhancements of LLAMA2 fine-tuned on Open Platypus dataset in 0-shot SNI evaluation and TruthfulQA evaluation after fine-tuning on a subset of Flan. We shed some light on the inner workings of MoPEs by comparing different routing strategies. We find that [PE] routing tends to collapse at downstream evaluation time reducing the importance of router's application. We plan to publicly release our code.
From IID to the Independent Mechanisms assumption in continual learning
Pau Rodriguez
Alexandre Lacoste
Current machine learning algorithms are successful in learning clearly defined tasks from large i.i.d. data. Continual learning (CL) require… (see more)s learning without iid-ness and developing algorithms capable of knowledge retention and transfer, the latter can be boosted through systematic generalization. Dropping the i.i.d. assumption requires replacing it with another hypothesis. While there are several candidates, here we advocate that the independent mechanism assumption (IM) (Sch¨olkopf et al., 2012) is a useful hypothesis for representing knowledge in a form, that makes it easy to adapt to new tasks in CL. Specifically, we review several types of distribution shifts that are common in CL and point out in which way a system that represents knowledge in the form of causal modules may outperform monolithic counterparts in CL. Intuitively, the efficacy of IM solution emerges since (i) causal modules learn mechanisms invariant across domains; (ii) if causal mechanisms must be updated, modularity can enable efficient and sparse updates.
From IID to the Independent Mechanisms assumption in continual learning
Pau Rodriguez
Alexandre Lacoste
Guiding Language Model Math Reasoning with Planning Tokens
Xinyi Wang
Lucas Caccia
Xingdi Yuan
William Yang Wang
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as cha… (see more)in-of-thought reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. We find that while LLMs can manage individual reasoning steps well, they struggle with maintaining consistency across an entire reasoning chain. To solve this, we introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets w.r.t. standard fine-tuning baselines.
Continual Learning with Foundation Models: An Empirical Study of Latent Replay
Timothee LESORT
Pau Rodriguez
Md Rifat Arefin
Arthur Douillard
Rapid development of large-scale pre-training has resulted in foundation models that can act as effective feature extractors on a variety of… (see more) downstream tasks and domains. Motivated by this, we study the efficacy of pre-trained vision models as a foundation for downstream continual learning (CL) scenarios. Our goal is twofold. First, we want to understand the compute-accuracy trade-off between CL in the raw-data space and in the latent space of pre-trained encoders. Second, we investigate how the characteristics of the encoder, the pre-training algorithm and data, as well as of the resulting latent space affect CL performance. For this, we compare the efficacy of various pre-trained models in large-scale benchmarking scenarios with a vanilla replay setting applied in the latent and in the raw-data space. Notably, this study shows how transfer, forgetting, task similarity and learning are dependent on the input data characteristics and not necessarily on the CL algorithms. First, we show that under some circumstances reasonable CL performance can readily be achieved with a non-parametric classifier at negligible compute. We then show how models pre-trained on broader data result in better performance for various replay sizes. We explain this with representational similarity and transfer properties of these representations. Finally, we show the effectiveness of self-supervised pre-training for downstream domains that are out-of-distribution as compared to the pre-training domain. We point out and validate several research directions that can further increase the efficacy of latent CL including representation ensembling. The diverse set of datasets used in this study can serve as a compute-efficient playground for further CL research. We will publish the code.
Attention for Compositional Modularity
Pau Rodriguez
Alexandre Lacoste
Modularity and compositionality are promising inductive biases for addressing longstanding problems in machine learning such as better syste… (see more)matic generalization, as well as better transfer and lower forgetting in the context of continual learning. Here we study how attention-based module selection can help achieve composi-tonal modularity – i.e. decomposition of tasks into meaningful sub-tasks which are tackled by independent architectural entities that we call modules. These sub-tasks must be reusable and the system should be able to learn them without additional supervision. We design a simple experimental setup in which the model is trained to solve mathematical equations with multiple math operations applied sequentially. We study different attention-based module selection strategies, inspired by the principles introduced in the recent literature. We evaluate the method’s ability to learn modules that can recover the underling sub-tasks (operation) used for data generation, as well as the ability to generalize compositionally. We find that meaningful module selection (i.e. routing) is the key to compositional generalization. Further, without access to the privileged information about which part of the input should be used for module selection, the routing component performs poorly for samples that are compositionally out of training distribution. We find that the the main reason for this lies in the routing component, since many of the tested methods perform well OOD if we report the performance of the best performing path at test time. Additionally, we study the role of the number of primitives, the number of training points and bottlenecks for modular specialization.