Alessandro Sordoni

Membre industriel principal

Professeur associé, Université de Montréal, Département d'informatique et de recherche opérationnelle

Chercheur scientifique, Microsoft Research Montréal

Sujets de recherche

Grands modèles de langage (LLM)

Raisonnement

Traitement du langage naturel

Biographie

Je suis chercheur principal à Microsoft Research Montréal. J'ai obtenu un doctorat de l'Université de Montréal sous la direction de Jian-Yun Nie, en étudiant comment représenter efficacement les documents et les requêtes pour la recherche d'information. Présentement, je m’intéresse à l'étude de l'efficacité de l'apprentissage et de la généralisation systématique dans les grands modèles actuels d'apprentissage profond. Mes intérêts s'étendent à l'apprentissage non supervisé et à l'apprentissage à petite échelle, en particulier dans le domaine du langage naturel.

Étudiants actuels

Zhan Su

Collaborateur·rice alumni - University of Copenhagen

Publications

A Case Study of Instruction Tuning with Mixture of Parameter-Efficient Experts

Lucas Caccia

We study the applicability of mixture of parameter-efficient experts (MoPEs) for instruction-tuning large decoder-only language models. Rece… (voir plus)nt literature indicates that MoPEs might enhance performance in specific multi-task instruction-following datasets. In this paper, we extend such previous results and study applicability of MoPEs in settings previously overlooked: a) with open-domain instruction-following datasets; b) with recent decoder-only models and c) with downstream out-of-distribution test sets. We build on top of LLaMA1-13B/-7B and LLaMA2-13B. We study different variants of learned routing, namely per-example routing ([PE]), and a more expensive per-token ([PT]) routing. Overall, we are unable to substantiate strong performance gains observed in related studies in our setting. We observe occasional enhancements of LLAMA2 fine-tuned on Open Platypus dataset in 0-shot SNI evaluation and TruthfulQA evaluation after fine-tuning on a subset of Flan. We shed some light on the inner workings of MoPEs by comparing different routing strategies. We find that [PE] routing tends to collapse at downstream evaluation time reducing the importance of router's application. We plan to publicly release our code.

2023-10-28

NeurIPS.cc/2023/Workshop/Instruction (publié)

openreview.net

Joint Prompt Optimization of Stacked LLMs using Variational Inference

Alessandro Sordoni

Eric Yuan

Xingdi Yuan

Marc-Alexandre Côté

Matheus Pereira

Adam Trischler

Ziang Xiao

Arian Hosseini

Friederike Niedtner

Nicolas Le Roux

Large language models (LLMs) can be seen as atomic units of computation mapping sequences to a distribution over sequences. Thus, they can b… (voir plus)e seen as stochastic language layers in a language network, where the learnable parameters are the natural language prompts at each layer. By stacking two such layers and feeding the output of one layer to the next, we obtain a Deep Language Network (DLN). We first show how to effectively perform prompt optimization for a 1-Layer language network (DLN-1). Then, we present an extension that applies to 2-layer DLNs (DLN-2), where two prompts must be learned. The key idea is to consider the output of the first layer as a latent variable, which requires inference, and prompts to be learned as the parameters of the generative distribution. We first test the effectiveness of DLN-1 in multiple reasoning and natural language understanding tasks. Then, we show that DLN-2 can reach higher performance than a single layer, showing promise that we might reach comparable performance to GPT-4, even when each LLM in the network is smaller and less powerful.

openreview.net

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

Parameter-efficient fine-tuning (PEFT) for cross-task generalization consists in pre-training adapters on a multi-task training set before f… (voir plus)ew-shot adaptation to test tasks. Polytropon [Ponti et al., 2023] (

openreview.net

Combining Parameter-efficient Modules for Task-level Generalisation

Edoardo Ponti

Alessandro Sordoni

Yoshua Bengio

Siva Reddy

2023-05-01

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (publié)

doi.org

Expressiveness and Learnability: A Unifying View for Evaluating Self-Supervised Learning

Yuchen Lu

Romain Laroche

2023-01-01

Trans. Mach. Learn. Res. (publié)

doi.org

openreview.net

Guiding Language Model Math Reasoning with Planning Tokens

Xinyi Wang

Lucas Caccia

Oleksiy Ostapenko

Xingdi Yuan

William Yang Wang

Alessandro Sordoni

Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as cha… (voir plus)in-of-thought reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. We find that while LLMs can manage individual reasoning steps well, they struggle with maintaining consistency across an entire reasoning chain. To solve this, we introduce planning tokens at the start of each reasoning step, serving as a guide for the model, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (just 0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets w.r.t. standard fine-tuning baselines.

2023-01-01

arXiv.org (prépublication)

doi.org

arxiv.org

Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods

Yuchen Lu

Romain Laroche

2023-01-01

Trans. Mach. Learn. Res. (publié)

openreview.net

On the Compositional Generalization Gap of In-Context Learning

Pretrained large generative language models have shown great performance on many tasks, but exhibit low compositional generalization abiliti… (voir plus)es. Scaling such models has been shown to improve their performance on various NLP tasks even just by conditioning them on a few examples to solve the task without any fine-tuning (also known as in-context learning). In this work, we look at the gap between the in-distribution (ID) and out-of-distribution (OOD) performance of such models in semantic parsing tasks with in-context learning. In the ID settings, the demonstrations are from the same split (\textit{test} or \textit{train}) that the model is being evaluated on, and in the OOD settings, they are from the other split. We look at how the relative generalization gap of in-context learning evolves as models are scaled up. We evaluate four model families, OPT, BLOOM, CodeGen and Codex on three semantic parsing datasets, CFQ, SCAN and GeoQuery with different number of exemplars, and observe a trend of decreasing relative generalization gap as models are scaled up.

2022-11-15

ArXiv (prépublication)

doi.org

arxiv.org

Multi-Head Adapter Routing for Cross-Task Generalization

Lucas Caccia

Edoardo Ponti

Zhan Su

Matheus Pereira

Nicolas Le Roux

Alessandro Sordoni

2022-11-07

ArXiv (prépublication)

arxiv.org

Does Pre-training Induce Systematic Inference? How Masked Language Models Acquire Commonsense Knowledge

Ian Porada

Alessandro Sordoni

Jackie Cheung

2022-07-01

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (publié)

doi.org

arxiv.org

Using Representation Expressiveness and Learnability to Evaluate Self-Supervised Learning Methods

Yuchen Lu

Romain Laroche

2022-06-02

ArXiv (prépublication)

arxiv.org