Publications

Improving Multilingual Math Reasoning for African Languages
Odunayo Ogundepo
Akintunde Oladipo
Kelechi Ogueji
Esther Adenuga
Jimmy Lin
Researchers working on low-resource languages face persistent challenges due to limited data availability and restricted access to computati… (see more)onal resources. Although most large language models (LLMs) are predominantly trained in high-resource languages, adapting them to low-resource contexts, particularly African languages, requires specialized techniques. Several strategies have emerged for adapting models to low-resource languages in todays LLM landscape, defined by multi-stage pre-training and post-training paradigms. However, the most effective approaches remain uncertain. This work systematically investigates which adaptation strategies yield the best performance when extending existing LLMs to African languages. We conduct extensive experiments and ablation studies to evaluate different combinations of data types (translated versus synthetically generated), training stages (pre-training versus post-training), and other model adaptation configurations. Our experiments focuses on mathematical reasoning tasks, using the Llama 3.1 model family as our base model.
Improving the Scaling Laws of Synthetic Data with Deliberate Practice
Reyhane Askari-Hemmat
Elvis Dohmatob
Pietro Astolfi
Melissa Hall
Jakob Verbeek
Adriana Romero-Soriano
Inspired by the principle of deliberate practice in human learning, we propose Deliberate Practice for Synthetic Data Generation (DP), a nov… (see more)el framework that improves sample efficiency through dynamic synthetic data generation. Prior work has shown that scaling synthetic data is inherently challenging, as naively adding new data leads to diminishing returns. To address this, pruning has been identified as a key mechanism for improving scaling, enabling models to focus on the most informative synthetic samples. Rather than generating a large dataset and pruning it afterward, DP efficiently approximates the direct generation of informative samples. We theoretically show how training on challenging, informative examples improves scaling laws and empirically validate that DP achieves better scaling performance with significantly fewer training samples and iterations. On ImageNet-100, DP generates 3.4x fewer samples and requires six times fewer iterations, while on ImageNet-1k, it generates 8x fewer samples with a 30 percent reduction in iterations, all while achieving superior performance compared to prior work.
Latent Veracity Inference for Identifying Errors in Stepwise Reasoning
Jean-Pierre R. Falet
Oliver E. Richardson
Moksh J. Jain
Sungsoo Ahn
Chain-of-Thought (CoT) reasoning has advanced the capabilities and transparency of language models (LMs); however, reasoning chains can cont… (see more)ain inaccurate statements that reduce performance and trustworthiness. To address this, we propose to augment each reasoning step in a CoT with a latent veracity (or correctness) variable. To efficiently explore this expanded space, we introduce Veracity Search (VS), a discrete search algorithm over veracity assignments. It performs otherwise intractable inference in the posterior distribution over latent veracity values by leveraging the LM's joint likelihood over veracity and the final answer as a proxy reward. This efficient inference-time verification method facilitates supervised fine-tuning of an Amortized Veracity Inference (AVI) machine by providing pseudo-labels for veracity. AVI generalizes VS, enabling accurate zero-shot veracity inference in novel contexts. Empirical results demonstrate that VS reliably identifies errors in logical (ProntoQA), mathematical (GSM8K), and commonsense (CommonsenseQA) reasoning benchmarks, with AVI achieving comparable zero-shot accuracy. Finally, we demonstrate the utility of latent veracity inference for providing feedback during self-correction and self-improvement.
Learning Penalty for Optimal Partitioning via Automatic Feature Extraction
Tung L. Nguyen
Changepoint detection identifies significant shifts in data sequences, making it important in areas like finance, genetics, and healthcare. … (see more)The Optimal Partitioning algorithms efficiently detect these changes, using a penalty parameter to limit the changepoints number. Determining the appropriate value for this penalty can be challenging. Traditionally, this process involved manually extracting statistical features, such as sequence length or variance to make the prediction. This study proposes a novel approach that uses recurrent neural networks to learn this penalty directly from raw sequences by automatically extracting features. Experiments conducted on 20 benchmark genomic datasets show that this novel method surpasses traditional methods in partitioning accuracy in most cases.
Locate 3D: Real-World Object Localization via Self-Supervised Learning in 3D
Sergio Arnaud
Paul McVay
Ada Martin
Arjun Majumdar
Krishna Murthy
Phillip Thomas
Ruslan Partsey
Daniel Dugas
Abha Gejji
Alexander Sax
Vincent-Pierre Berges
Mikael Henaff
Ayush Jain
Ang Cao
Ishita Prasad
Mrinal Kalakrishnan
Michael G. Rabbat
Mahmoud Assran
Oleksandr Maksymets … (see 2 more)
Aravind Rajeswaran
Franziska Meier
Mechanistic Unlearning: Robust Knowledge Unlearning and Editing via Mechanistic Localization
Phillip Huang Guo
Aaquib Syed
Abhay Sheshadri
Aidan Ewart
Mitigating Plasticity Loss in Continual Reinforcement Learning by Reducing Churn
Hongyao Tang
Johan Obando-Ceron
Plasticity, or the ability of an agent to adapt to new tasks, environments, or distributions, is crucial for continual learning. In this pap… (see more)er, we study the loss of plasticity in deep continual RL from the lens of churn: network output variability for out-of-batch data induced by mini-batch training. We demonstrate that (1) the loss of plasticity is accompanied by the exacerbation of churn due to the gradual rank decrease of the Neural Tangent Kernel (NTK) matrix; (2) reducing churn helps prevent rank collapse and adjusts the step size of regular RL gradients adaptively. Moreover, we introduce Continual Churn Approximated Reduction (C-CHAIN) and demonstrate it improves learning performance and outperforms baselines in a diverse range of continual learning environments on OpenAI Gym Control, ProcGen, DeepMind Control Suite, and MinAtar benchmarks.
Multi-Modal Language Models as Text-to-Image Model Evaluators
Jiahui Chen
Candace Ross
Melissa Hall
Adriana Romero
Network Sparsity Unlocks the Scaling Potential of Deep Reinforcement Learning
Guozheng Ma
Zilin Wang
Li Shen
Dacheng Tao
Effectively scaling up deep reinforcement learning models has proven notoriously difficult due to network pathologies during training, motiv… (see more)ating various targeted interventions such as periodic reset and architectural advances such as layer normalization. Instead of pursuing more complex modifications, we show that introducing static network sparsity alone can unlock further scaling potential beyond their dense counterparts with state-of-the-art architectures. This is achieved through simple one-shot random pruning, where a predetermined percentage of network weights are randomly removed once before training. Our analysis reveals that, in contrast to naively scaling up dense DRL networks, such sparse networks achieve both higher parameter efficiency for network expressivity and stronger resistance to optimization challenges like plasticity loss and gradient interference. We further extend our evaluation to visual and streaming RL scenarios, demonstrating the consistent benefits of network sparsity.
Outsourced Diffusion Sampling: Efficient Posterior Inference in Latent Spaces of Generative Models
Any well-behaved generative model over a variable …
Proceedings of 1st Workshop on Advancing Artificial Intelligence through Theory of Mind
Mouad Abrini
Omri Abend
Dina M. Acklin
Henny Admoni
Gregor Aichinger
Nitay Alon
Zahra Ashktorab
Ashish Atreja
Moises Auron
Alexander Aufreiter
Raghav Awasthi
Soumya Banerjee
Joseph Barnby
Rhea Basappa
Severin Bergsmann
Djallel Bouneffouf
Patrick Callaghan
Marc Cavazza
Thierry Chaminade
Sonia Chernova … (see 88 more)
Mohamed Chetouan
Moumita Choudhury
Axel Cleeremans
J. Cywinski
Fabio Cuzzolin
Hokin Deng
N'yoma Diamond
C. D. Pasquasio
Max J. van Duijn
Mahapatra Dwarikanath
Qingying Gao
Ashok Goel
Rebecca R. Goldstein
Matthew C. Gombolay
Gabriel Enrique Gonzalez
Amar Halilovic
Tobias Halmdienst
Mahimul Islam
Julian Jara-Ettinger
Natalie Kastel
Renana Keydar
Ashish K. Khanna
Mahdi Khoramshahi
Jihyun Kim
Mihyeon Kim
Youngbin Kim
Senka Krivic
Nikita Krasnytskyi
Arun Kumar
Junehyoung Kwon
EunJu Lee
Shane Lee
Peter R. Lewis 0001
Xue Li
Yijiang Li
Michal Lewandowski
Nathan Lloyd
Matthew B. Luebbers
Dezhi Luo
Haiyun Lyu
Dwarikanath Mahapatra
Kamal Maheshwari
Mallika Mainali
P. Mathur
Patrick Mederitsch
Shuwa Miura
Manuel Preston de Miranda
Reuth Mirsky
Shreya Mishra
Nina M. Moorman
Katelyn Morrison
John Muchovej
Bernhard Nessler
Felix Nessler
Hieu Minh Jord Nguyen
Abby Ortego
F. Papay
Antoine Pasquali
Hamed Rahimi
C. Raghu
Amanda L. Royka
Stefan Sarkadi
Jaelle Scheuerman
Simon Schmid
Paul Schrater
Anik Sen
Ke Shi
Reid G. Simmons
Nishant Singh
Mason O. Smith
Ramira van der Meulen
Anthia Solaki
Haoran Sun
Viktor Szolga
Matthew E. Taylor
Travis Taylor
Sanne van Waveren
R. Verbrugge
Eitan Wagner
Justin D. Weisz
Ximing Wen
William Yeoh
Wenlong Zhang
Michelle Zhao
Shlomo Zilberstein
Replication of a GWAS signal near HLA-DQA2 with AML using a disease-only cohort and external population-based controls
Rose Laflamme
Véronique Lisi
Josée Hébert
Guy Sauvageau
Vincent-Philippe Lavallee
Guillaume Lettre