Portrait of Zhan Su is unavailable

Zhan Su

Research Intern - University of Copenhagen
Supervisor

Publications

Towards Modular LLMs by Building and Reusing a Library of LoRAs
Oleksiy Ostapenko
Zhan Su
Edoardo Ponti
Matheus Pereira
Lucas Caccia
The growing number of parameter-efficient adaptations of a base large language model (LLM) calls for studying whether we can reuse such trai… (see more)ned adapters to improve performance for new tasks. We study how to best build a library of adapters given multi-task data and devise techniques for both zero-shot and supervised task generalization through routing in such library. We benchmark existing approaches to build this library and introduce model-based clustering, MBC, a method that groups tasks based on the similarity of their adapter parameters, indirectly optimizing for transfer across the multi-task dataset. To re-use the library, we present a novel zero-shot routing mechanism, Arrow, which enables dynamic selection of the most relevant adapters for new inputs without the need for retraining. We experiment with several LLMs, such as Phi-2 and Mistral, on a wide array of held-out tasks, verifying that MBC-based adapters and Arrow routing lead to superior generalization to new tasks. We make steps towards creating modular, adaptable LLMs that can match or outperform traditional joint training.
Towards Modular LLMs by Building and Reusing a Library of LoRAs
Oleksiy Ostapenko
Zhan Su
Edoardo Ponti
Matheus Pereira
Lucas Caccia
A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts
Oleksiy Ostapenko
Lucas Caccia
Zhan Su
We study the applicability of mixture of parameter-efficient experts (MoPEs) for instruction-tuning large decoder-only language models. Rece… (see more)nt literature indicates that MoPEs might enhance performance in specific multi-task instruction-following datasets. In this paper, we extend such previous results and study applicability of MoPEs in settings previously overlooked: a) with open-domain instruction-following datasets; b) with recent decoder-only models and c) with downstream out-of-distribution test sets. We build on top of LLaMA1-13B/-7B and LLaMA2-13B. We study different variants of learned routing, namely per-example routing ([PE]), and a more expensive per-token ([PT]) routing. Overall, we are unable to substantiate strong performance gains observed in related studies in our setting. We observe occasional enhancements of LLAMA2 fine-tuned on Open Platypus dataset in 0-shot SNI evaluation and TruthfulQA evaluation after fine-tuning on a subset of Flan. We shed some light on the inner workings of MoPEs by comparing different routing strategies. We find that [PE] routing tends to collapse at downstream evaluation time reducing the importance of router's application. We plan to publicly release our code.
Multi-Head Adapter Routing for Cross-Task Generalization
Lucas Caccia
Edoardo Ponti
Zhan Su
Matheus Pereira
Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (see more)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (
Multi-Head Adapter Routing for Cross-Task Generalization
Lucas Caccia
Edoardo Ponti
Zhan Su
Matheus Pereira
Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (see more)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (