Kartik Ahuja

Beyond Multi-Token Prediction: Pretraining LLMs with Future Summaries

Sachin Goyal

Badr Youbi Idrissi

David Lopez-Paz

Next-token prediction (NTP) has driven the success of large language models (LLMs), but it struggles with long-horizon reasoning, planning, … (see more)and creative writing, with these limitations largely attributed to teacher-forced training. Multi-token prediction (MTP) partially mitigates these issues by predicting several future tokens at once, but it mostly captures short-range dependencies and offers limited improvement. We propose future summary prediction (FSP), which trains an auxiliary head to predict a compact representation of the long-term future, preserving information relevant for long-form generations. We explore two variants of FSP: handcrafted summaries, for example, a bag of words summary of the future of the sequence, and learned summaries, which use embeddings produced by a reverse language model trained from right to left. Large-scale pretraining experiments (3B and 8B-parameter models) demonstrate that FSP provides improvements over both NTP and MTP across math, reasoning, and coding benchmarks.

2025-12-31

International Conference on Learning Representations (Accept (Poster))

doi.org

openreview.net

Operationalizing Quantized Disentanglement

Vitória Barin-Pacela

Kartik Ahuja

Simon Lacoste-Julien

P Vincent

2025-11-24

ArXiv (preprint)

doi.org

arxiv.org

Compositional Risk Minimization

Charles Arnal

Compositional generalization is a crucial step towards developing data-efficient intelligent machines that generalize in human-like ways. In… (see more) this work, we tackle a challenging form of distribution shift, termed compositional shift, where some attribute combinations are completely absent at training but present in the test distribution. This shift tests the model's ability to generalize compositionally to novel attribute combinations in discriminative tasks. We model the data with flexible additive energy distributions, where each energy term represents an attribute, and derive a simple alternative to empirical risk minimization termed compositional risk minimization (CRM). We first train an additive energy classifier to predict the multiple attributes and then adjust this classifier to tackle compositional shifts. We provide an extensive theoretical analysis of CRM, where we show that our proposal extrapolates to special affine hulls of seen attribute combinations. Empirical evaluations on benchmark datasets confirms the improved robustness of CRM compared to other methods from the literature designed to tackle various forms of subpopulation shifts.

2025-10-05

Proceedings of the 42nd International Conference on Machine Learning (published)

doi.org

proceedings.mlr.press

Quantized Disentanglement: A Practical Approach

Vitória Barin-Pacela

Kartik Ahuja

Simon Lacoste-Julien

P Vincent

2025-06-08

ICML.cc/2025/Workshop/SIM (poster)

openreview.net

Reusable Slotwise Mechanisms

Trang Nguyen

Amin Mansouri

Kanika Madan

Khuong Nguyen

Nguyen Duy Khuong

Kartik Ahuja

Dianbo Liu

Yoshua Bengio

Agents with the ability to comprehend and reason about the dynamics of objects would be expected to exhibit improved robustness and generali… (see more)zation in novel scenarios. However, achieving this capability necessitates not only an effective scene representation but also an understanding of the mechanisms governing interactions among object subsets. Recent studies have made significant progress in representing scenes using object slots. In this work, we introduce Reusable Slotwise Mechanisms, or RSM, a framework that models object dynamics by leveraging communication among slots along with a modular architecture capable of dynamically selecting reusable mechanisms for predicting the future states of each object slot. Crucially, RSM leverages the Central Contextual Information (CCI), enabling selected mechanisms to access the remaining slots through a bottleneck, effectively allowing for modeling of higher order and complex interactions that might require a sparse subset of objects. Experimental results demonstrate the superior performance of RSM compared to state-of-the-art methods across various future prediction and related downstream tasks, including Visual Question Answering and action planning. Furthermore, we showcase RSM's Out-of-Distribution generalization ability to handle scenes in intricate scenarios.

2023-09-20

NeurIPS.cc/2023/Conference (poster)

doi.org

openreview.net

WOODS: Benchmarks for Out-of-Distribution Generalization in Time Series

Jean-christophe Gagnon-audet

Kartik Ahuja

Mohammad-Javad Darvishi-Bayazi

Pooneh Mousavi

Guillaume Dumas

Irina Rish

2023-09-01

Transactions on Machine Learning Research (accepted)

doi.org

openreview.net

On the Identifiability of Quantized Factors

Vitória Barin-Pacela

Kartik Ahuja

Simon Lacoste-Julien

Pascal Vincent

Disentanglement aims to recover meaningful latent ground-truth factors from the observed distribution solely, and is formalized through the … (see more)theory of identifiability. The identifiability of independent latent factors is proven to be impossible in the unsupervised i.i.d. setting under a general nonlinear map from factors to observations. In this work, however, we demonstrate that it is possible to recover quantized latent factors under a generic nonlinear diffeomorphism. We only assume that the latent factors have independent discontinuities in their density, without requiring the factors to be statistically independent. We introduce this novel form of identifiability, termed quantized factor identifiability, and provide a comprehensive proof of the recovery of the quantized factors.

2023-06-18

ICML.cc/2023/Workshop/SPIGM (poster)

doi.org

proceedings.mlr.press

Empirical Study on Optimizer Selection for Out-of-Distribution Generalization

Hiroki Naganuma

Kartik Ahuja

Ioannis Mitliagkas

Shiro Takagi

Tetsuya Motokawa

Rio Yokota

Kohta Ishikawa

Ikuro Sato

Modern deep learning systems do not generalize well when the test data distribution is slightly different to the training data distribution.… (see more) While much promising work has been accomplished to address this fragility, a systematic study of the role of optimizers and their out-of-distribution generalization performance has not been undertaken. In this study, we examine the performance of popular first-order optimizers for different classes of distributional shift under empirical risk minimization and invariant risk minimization. We address this question for image and text classification using DomainBed, WILDS, and Backgrounds Challenge as testbeds for studying different types of shifts---namely correlation and diversity shift. We search over a wide range of hyperparameters and examine classification accuracy (in-distribution and out-of-distribution) for over 20,000 models. We arrive at the following findings, which we expect to be helpful for practitioners: i) adaptive optimizers (e.g., Adam) perform worse than non-adaptive optimizers (e.g., SGD, momentum SGD) on out-of-distribution performance. In particular, even though there is no significant difference in in-distribution performance, we show a measurable difference in out-of-distribution performance. ii) in-distribution performance and out-of-distribution performance exhibit three types of behavior depending on the dataset---linear returns, increasing returns, and diminishing returns. For example, in the training of natural language data using Adam, fine-tuning the performance of in-distribution performance does not significantly contribute to the out-of-distribution generalization performance.

2023-06-15

TMLR (accepted)

doi.org

openreview.net

Interventional Causal Representation Learning

Kartik Ahuja

Yixin Wang

Divyat Mahajan

Yoshua Bengio

Causal representation learning seeks to extract high-level latent factors from low-level sensory data. Most existing methods rely on observa… (see more)tional data and structural assumptions (e.g., conditional independence) to identify the latent factors. However, interventional data is prevalent across applications. Can interventional data facilitate causal representation learning? We explore this question in this paper. The key observation is that interventional data often carries geometric signatures of the latent factors' support (i.e. what values each latent can possibly take). For example, when the latent factors are causally connected, interventions can break the dependency between the intervened latents' support and their ancestors'. Leveraging this fact, we prove that the latent causal factors can be identified up to permutation and scaling given data from perfect

2023-04-23

ICML.cc/2023/Conference (poster)

doi.org

proceedings.mlr.press

Object-centric causal representation learning

Amin Mansouri

Jason Hartford

Kartik Ahuja

Yoshua Bengio

2022-11-06

NeurIPS.cc/2022/Workshop/NeurReps (poster)

openreview.net

FL Games: A federated learning framework for distribution shifts

Niladri Chatterjee

Federated learning aims to train predictive models for data that is distributed across clients, under the orchestration of a server. However… (see more), participating clients typically each hold data from a different distribution, whereby predictive models with strong in-distribution generalization can fail catastrophically on unseen domains. In this work, we argue that in order to generalize better across non-i.i.d. clients, it is imperative to only learn correlations that are stable and invariant across domains. We propose FL Games, a game-theoretic framework for federated learning for learning causal features that are invariant across clients. While training to achieve the Nash equilibrium, the traditional best response strategy suffers from high-frequency oscillations. We demonstrate that FL Games effectively resolves this challenge and exhibits smooth performance curves. Further, FL Games scales well in the number of clients, requires significantly fewer communication rounds, and is agnostic to device heterogeneity. Through empirical evaluation, we demonstrate that FL Games achieves high out-of-distribution performance on various benchmarks.

2022-10-19

NeurIPS.cc/2022/Workshop/Federated_Learning (oral)

doi.org

openreview.net

Towards efficient representation identification in supervised learning

Kartik Ahuja

Divyat Mahajan

Vasilis Syrgkanis

Ioannis Mitliagkas

Humans have a remarkable ability to disentangle complex sensory inputs (e.g., image, text) into simple factors of variation (e.g., shape, co… (see more)lor) without much supervision. This ability has inspired many works that attempt to solve the following question: how do we invert the data generation process to extract those factors with minimal or no supervision? Several works in the literature on non-linear independent component analysis have established this negative result; without some knowledge of the data generation process or appropriate inductive biases, it is impossible to perform this inversion. In recent years, a lot of progress has been made on disentanglement under structural assumptions, e.g., when we have access to auxiliary information that makes the factors of variation conditionally independent. However, existing work requires a lot of auxiliary information, e.g., in supervised classification, it prescribes that the number of label classes should be at least equal to the total dimension of all factors of variation. In this work, we depart from these assumptions and ask: a) How can we get disentanglement when the auxiliary information does not provide conditional independence over the factors of variation? b) Can we reduce the amount of auxiliary information required for disentanglement? For a class of models where auxiliary information does not ensure conditional independence, we show theoretically and experimentally that disentanglement (to a large extent) is possible even when the auxiliary information dimension is much less than the dimension of the true latent representation.

2022-06-27

Proceedings of the First Conference on Causal Learning and Reasoning (published)

doi.org

proceedings.mlr.press

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Kartik Ahuja

Publications

AI Policy Fellowship Publications

Mila Ventures Launchpad

AI Policy Compass

Popular keywords:

Kartik Ahuja

Publications