Timothy O'Donnell

Mario Giulianelli

Juan Luis Gastaldi

Brian DuSell

John Terilla

Ryan Cotterell

Modern language models are internally—and mathematically—distributions over *token* strings rather than *character* strings, posing nume… (voir plus)rous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

2025-05-01

ICML.cc/2025/Conference (poster)

Language Models over Canonical Byte-Pair Encodings

Tim Vieira

Tianyu Liu

Clemente Pasti

Yahya Emara

Brian DuSell

Mario Giulianelli

Juan Luis Gastaldi

Ryan Cotterell

2025-05-01

ICML.cc/2025/Conference (poster)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Lei Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/genparse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Li Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (voir plus) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (présentation orale)

Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming

João Loula

Katherine M. Collins

Ulrich Schaechtle

Joshua B. Tenenbaum

Adrian Weller

Feras Saad

Vikash Mansinghka

Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, … (voir plus)such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.

2024-06-17

ICML.cc/2024/Workshop/AccMLBio (poster)

Learning Generative Population Models From Multiple Clinical Datasets Via Probabilistic Programming

João Loula

Katherine M. Collins

Ulrich Schaechtle

Joshua B. Tenenbaum

Adrian Weller

Feras Saad

Vikash Mansinghka

Accurate, efficient generative models of clinical populations could accelerate clinical research and improve patient outcomes. For example, … (voir plus)such models could infer probable treatment outcomes for different subpopulations, generate high-fidelity synthetic data that can be shared across organizational boundaries, and discover new relationships among clinical variables. Using Bayesian structure learning, we show that it is possible to learn probabilistic program models of clinical populations by combining data from multiple, sparsely overlapping clinical datasets. Through experiments with multiple clinical trials and real-world evidence from census health surveys, we show that our model generates higher quality synthetic data than neural network baselines, supports more accurate inferences across datasets than traditional statistical methods, and can be queried more efficiently than both, opening up new avenues for accessible and efficient AI assistance in clinical research.

2024-06-17

ICML.cc/2024/Workshop/AccMLBio (poster)

Systematic Generalization by Finetuning? Analyzing Pretrained Language Models Using Constituency Tests

Aishik Chakraborty

Jackie Cheung

Constituents are groups of words that behave as a syntactic unit. Many linguistic phenomena (e.g., question formation, diathesis alternation… (voir plus)s) require the manipulation and rearrangement of constituents in a sentence. In this paper, we investigate how different finetuning setups affect the ability of pretrained sequence-to-sequence language models such as BART and T5 to replicate constituency tests — transformations that involve manipulating constituents in a sentence. We design multiple evaluation settings by varying the combinations of constituency tests and sentence types that a model is exposed to during finetuning. We show that models can replicate a linguistic transformation on a specific type of sentence that they saw during finetuning, but performance degrades substantially in other settings, showing a lack of systematic generalization. These results suggest that models often learn to manipulate sentences at a surface level unrelated to the constituent-level syntactic structure, for example by copying the first word of a sentence. These results may partially explain the brittleness of pretrained language models in downstream tasks.

2023-12-01

Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP (publié)

The Plausibility of Sampling as an Algorithmic Theory of Sentence Processing

Jacob Louis Hoover

Morgan Sonderegger

Steven T. Piantadosi

Abstract Words that are more surprising given context take longer to process. However, no incremental parsing algorithm has been shown to di… (voir plus)rectly predict this phenomenon. In this work, we focus on a class of algorithms whose runtime does naturally scale in surprisal—those that involve repeatedly sampling from the prior. Our first contribution is to show that simple examples of such algorithms predict runtime to increase superlinearly with surprisal, and also predict variance in runtime to increase. These two predictions stand in contrast with literature on surprisal theory (Hale, 2001; Levy, 2008a) which assumes that the expected processing cost increases linearly with surprisal, and makes no prediction about variance. In the second part of this paper, we conduct an empirical study of the relationship between surprisal and reading time, using a collection of modern language models to estimate surprisal. We find that with better language models, reading time increases superlinearly in surprisal, and also that variance increases. These results are consistent with the predictions of sampling-based algorithms.

2023-06-01

Open Mind (publié)

The Stable Entropy Hypothesis and Entropy-Aware Decoding: An Analysis and Algorithm for Robust Natural Language Generation

Kushal Arora

Doina Precup

Jason Aaron Edward Weston

Jackie C.K.Cheung

State-of-the-art language generation models can degenerate when applied to open-ended generation problems such as text completion, story gen… (voir plus)eration, or dialog modeling. This degeneration usually shows up in the form of incoherence, lack of vocabulary diversity, and self-repetition or copying from the context. In this paper, we postulate that ``human-like'' generations usually lie in a narrow and nearly flat entropy band, and violation of these entropy bounds correlates with degenerate behavior. Our experiments show that this stable narrow entropy zone exists across models, tasks, and domains and confirm the hypothesis that violations of this zone correlate with degeneration. We then use this insight to propose an entropy-aware decoding algorithm that respects these entropy bounds resulting in less degenerate, more contextual, and"human-like"language generation in open-ended text generation settings.

2023-02-14

ArXiv (prépublication)

arxiv.org

Simplicity and learning to distinguish arguments from modifiers

Leon Bergen

E. Gibson

2022-12-28

Journal of Language Modelling (publié)

Characterizing Idioms: Conventionality and Contingency

Michaela Socolof

Jackie Cheung

Michael Wagner

Idioms are unlike most phrases in two important ways. First, words in an idiom have non-canonical meanings. Second, the non-canonical meanin… (voir plus)gs of words in an idiom are contingent on the presence of other words in the idiom. Linguistic theories differ on whether these properties depend on one another, as well as whether special theoretical machinery is needed to accommodate idioms. We define two measures that correspond to the properties above, and we show that idioms fall at the expected intersection of the two dimensions, but that the dimensions themselves are not correlated. Our results suggest that introducing special machinery to handle idioms may not be warranted.

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

arxiv.org

Compositional Generalization in Dependency Parsing

Emily D. Goodwin

Siva Reddy

Dzmitry Bahdanau

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)