Benjamin LeBrun

Mario Giulianelli

Juan Luis Gastaldi

Brian DuSell

John Terilla

Ryan Cotterell

Modern language models are internally—and mathematically—distributions over *token* strings rather than *character* strings, posing nume… (see more)rous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model's compression rate (bits/byte) is achieved.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

From Language Models over Tokens to Language Models over Characters

Tim Vieira

Mario Giulianelli

Juan Luis Gastaldi

Brian DuSell

John Anthony Terilla

Timothy J. O’Donnell

Ryan Cotterell

Modern language models are internally—and mathematically—distributions over token strings rather than character string… (see more)s, posing numerous challenges for programmers building user applications on top of them. For example, if a prompt is specified as a character string, it must be tokenized before passing it to the token-level language model. Thus, the tokenizer and consequent processing are very sensitive to the specification of the prompt (e.g., whether the prompt ends with a space or not). This paper presents algorithms for converting token-level language models to character-level ones. We present both exact and approximate algorithms. In the empirical portion of the paper, we benchmark the practical runtime and approximation quality. Across four publicly available language models, we find that—even with a small computation budget—our method is able to accurately approximate the character-level distribution at reasonably fast speeds, and that a significant improvement in the language model’s compression rate (bits/byte) is achieved.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Language Models over Canonical Byte-Pair Encodings

Tim Vieira

Tianyu Liu

Clemente Pasti

Yahya Emara

Brian DuSell

Mario Giulianelli

Juan Luis Gastaldi

Ryan Cotterell

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Language Models over Canonical Byte-Pair Encodings

Tim Vieira

Tianyu Liu

Clemente Pasti

Yahya Emara

Brian DuSell

Mario Giulianelli

Juan Luis Gastaldi

Timothy J. O’Donnell

Ryan Cotterell

Modern language models represent probability distributions over character strings as distributions over (shorter) token strings derived via … (see more)a deterministic tokenizer, such as byte-pair encoding. While this approach is highly effective at scaling up language models to large corpora, its current incarnations have a concerning property: the model assigns nonzero probability mass to an exponential number of noncanonical token encodings of each character string—these are token strings that decode to valid character strings but are impossible under the deterministic tokenizer (i.e., they will never be seen in any training corpus, no matter how large). This misallocation is both erroneous, as noncanonical strings never appear in training data, and wasteful, diverting probability mass away from plausible outputs. These are avoidable mistakes! In this work, we propose methods to enforce canonicality in token-level language models, ensuring that only canonical token strings are assigned positive probability. We present two approaches: (1) canonicality by conditioning, leveraging test-time inference strategies without additional training, and (2) canonicality by construction, a model parameterization that guarantees canonical outputs but requires training. We demonstrate that fixing canonicality mistakes improves the likelihood of held-out data for several models and corpora.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling

Ben Lipkin

Jacob Hoover Vigly

João Loula

David R. MacIver

Lei Du

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

The dominant approach to generating from language models subject to some constraint is locally constrained decoding (LCD), incrementally sam… (see more)pling tokens at each time step such that the constraint is never violated. Typically, this is achieved through token masking: looping over the vocabulary and excluding non-conforming tokens. There are two important problems with this approach. (i) Evaluating the constraint on every token can be prohibitively expensive---LM vocabularies often exceed 100,000 tokens. (ii) LCD can distort the global distribution over strings, sampling tokens based only on local information, even if they lead down dead-end paths. This work introduces a new algorithm that addresses both these problems. First, to avoid evaluating a constraint on the full vocabulary at each step of generation, we propose an adaptive rejection sampling algorithm that typically requires orders of magnitude fewer constraint evaluations. Second, we show how this algorithm can be extended to produce low-variance, unbiased estimates of importance weights at a very small additional cost---estimates that can be soundly used within previously proposed sequential Monte Carlo algorithms to correct for the myopic behavior of local constraint enforcement. Through extensive empirical evaluation in text-to-SQL, molecular synthesis, goal inference, pattern matching, and JSON domains, we show that our approach is superior to state-of-the-art baselines, supporting a broader class of constraints and improving both runtime and performance. Additional theoretical and empirical analyses show that our method's runtime efficiency is driven by its dynamic use of computation, scaling with the divergence between the unconstrained and constrained LM, and as a consequence, runtime improvements are greater for better models.

2025-07-07

colmweb.org/COLM/2025/Conference (accepted)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Li Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (see more) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/gen-parse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (oral)

Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo

João Loula

Lei Du

Ben Lipkin

Clemente Pasti

Gabriel Grand

Tianyu Liu

Yahya Emara

Marjorie Freedman

Jason Eisner

Ryan Cotterell

Vikash Mansinghka

Alexander K. Lew

Tim Vieira

A wide range of LM applications require generating text that conforms to syntactic or semantic constraints. Imposing such constraints can be… (see more) naturally framed as probabilistic conditioning, but exact generation from the resulting distribution—which can differ substantially from the LM’s base distribution—is generally intractable. In this work, we develop an architecture for controlled LM generation based on sequential Monte Carlo (SMC). This SMC framework allows us to flexibly incorporate domain- and problem-specific constraints at inference time, and efficiently reallocate computational resources in light of new information during the course of generation. By comparing to a number of alternatives and ablations on four challenging domains—Python code generation for data science, text-to-SQL, goal inference, and molecule synthesis—we demonstrate that, with little overhead, our approach allows small open-source language models to outperform models over 8× larger, as well as closed-source, fine-tuned ones. In support of the probabilistic perspective, we show that these performance improvements are driven by better approximation to the posterior distribution. [Our system](https://github.com/probcomp/genparse) builds on the framework of Lew et al. (2023) and integrates with its language model probabilistic programming language, giving users a simple, programmable way to apply SMC to a broad variety of controlled generation problems.

2025-01-22

ICLR.cc/2025/Conference (oral)