Portrait de Timothy O'Donnell

Timothy O'Donnell

Membre académique principal
Chaire en IA Canada-CIFAR
Professeur adjoint, McGill University, Département de linguistique
Sujets de recherche
Modèles probabilistes
Théorie de l'information
Traitement du langage naturel

Biographie

Je suis professeur adjoint au Département de linguistique de l’Université McGill. Dans mes recherches, j'élabore des modèles mathématiques de généralisation, d'apprentissage et de traitement du langage. Mes travaux s'appuient sur les méthodes expérimentales en psychologie, les techniques de modélisation formelle utilisées dans le traitement du langage naturel, les outils théoriques de la linguistique, et les problématiques de ces trois domaines.

Étudiants actuels

Doctorat - McGill
Doctorat - McGill
Doctorat - McGill
Doctorat - McGill
Superviseur⋅e principal⋅e :
Collaborateur·rice alumni - McGill
Doctorat - McGill

Publications

From Language Models over Tokens to Language Models over Characters
Tim Vieira
Benjamin LeBrun
Mario Giulianelli
Juan Luis Gastaldi
Brian DuSell
John Terilla
Ryan Cotterell
Modern language models are internally—and mathematically—distributions over *token* strings rather than *character* strings, posing nume… (voir plus)rous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.
Language Models over Canonical Byte-Pair Encodings
Tim Vieira
Tianyu Liu
Clemente Pasti
Yahya Emara
Brian DuSell
Benjamin LeBrun
Mario Giulianelli
Juan Luis Gastaldi
Ryan Cotterell
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula
Benjamin LeBrun
Lei Du
Ben Lipkin
Clemente Pasti
Gabriel Grand
Tianyu Liu
Yahya Emara
Marjorie Freedman
Jason Eisner
Ryan Cotterell
Vikash Mansinghka
Alexander K. Lew
Tim Vieira
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/genparse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo
João Loula
Benjamin LeBrun
Li Du
Ben Lipkin
Clemente Pasti
Gabriel Grand
Tianyu Liu
Yahya Emara
Marjorie Freedman
Jason Eisner
Ryan Cotterell
Vikash Mansinghka
Alexander K. Lew
Tim Vieira
A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.
Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming
João Loula
Katherine M. Collins
Ulrich Schaechtle
Joshua B. Tenenbaum
Adrian Weller
Feras Saad
Vikash Mansinghka
Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, … (voir plus)such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.
Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming
João Loula
Katherine M. Collins
Ulrich Schaechtle
Joshua B. Tenenbaum
Adrian Weller
Feras Saad
Vikash Mansinghka
Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, … (voir plus)such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.
Systematic Generalization by Finetuning? Analyzing Pretrained Language Models Using Constituency Tests
Aishik Chakraborty
Constituents are groups of words that behave as a syntactic unit. Many linguistic phenomena (e.g., question formation, diathesis alternation… (voir plus)s) require the manipulation and rearrangement of constituents in a sentence. In this paper, we investigate how different finetuning setups affect the ability of pretrained sequence-to-sequence language models such as BART and T5 to replicate constituency tests — transformations that involve manipulating constituents in a sentence. We design multiple evaluation settings by varying the combinations of constituency tests and sentence types that a model is exposed to during finetuning. We show that models can replicate a linguistic transformation on a specific type of sentence that they saw during finetuning, but performance degrades substantially in other settings, showing a lack of systematic generalization. These results suggest that models often learn to manipulate sentences at a surface level unrelated to the constituent-level syntactic structure, for example by copying the first word of a sentence. These results may partially explain the brittleness of pretrained language models in downstream tasks.
The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing
Jacob Louis Hoover
Morgan Sonderegger
Steven T. Piantadosi
Abstract Words that are more surprising given context take longer to process. However, no incremental parsing algorithm has been shown to di… (voir plus)rectly predict this phenomenon. In this work, we focus on a class of algorithms whose runtime does naturally scale in surprisal—those that involve repeatedly sampling from the prior. Our first contribution is to show that simple examples of such algorithms predict runtime to increase superlinearly with surprisal, and also predict variance in runtime to increase. These two predictions stand in contrast with literature on surprisal theory (Hale, 2001; Levy, 2008a) which assumes that the expected processing cost increases linearly with surprisal, and makes no prediction about variance. In the second part of this paper, we conduct an empirical study of the relationship between surprisal and reading time, using a collection of modern language models to estimate surprisal. We find that with better language models, reading time increases superlinearly in surprisal, and also that variance increases. These results are consistent with the predictions of sampling-based algorithms.
The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation
Kushal Arora
Jason Aaron Edward Weston
Jackie C.K.Cheung
State-of-the-art language generation models can degenerate when applied to open-ended generation problems such as text completion, story gen… (voir plus)eration, or dialog modeling. This degeneration usually shows up in the form of incoherence, lack of vocabulary diversity, and self-repetition or copying from the context. In this paper, we postulate that ``human-like'' generations usually lie in a narrow and nearly flat entropy band, and violation of these entropy bounds correlates with degenerate behavior. Our experiments show that this stable narrow entropy zone exists across models, tasks, and domains and confirm the hypothesis that violations of this zone correlate with degeneration. We then use this insight to propose an entropy-aware decoding algorithm that respects these entropy bounds resulting in less degenerate, more contextual, and"human-like"language generation in open-ended text generation settings.
Simplicity and learning to distinguish arguments from modifiers
Leon Bergen
E. Gibson
Characterizing Idioms: Conventionality and Contingency
Michaela Socolof
Michael Wagner
Idioms are unlike most phrases in two important ways. First, words in an idiom have non-canonical meanings. Second, the non-canonical meanin… (voir plus)gs of words in an idiom are contingent on the presence of other words in the idiom. Linguistic theories differ on whether these properties depend on one another, as well as whether special theoretical machinery is needed to accommodate idioms. We define two measures that correspond to the properties above, and we show that idioms fall at the expected intersection of the two dimensions, but that the dimensions themselves are not correlated. Our results suggest that introducing special machinery to handle idioms may not be warranted.
Compositional Generalization in Dependency Parsing