Diffusion Tree Sampling: Scalable inference‑time alignment of diffusion models
Vineet Jain
Kusha Sareen
Mohammad Pedramfar
Adapting a pretrained diffusion model to new objectives at inference time remains an open problem in generative modeling. Existing steering … (voir plus)methods suffer from inaccurate value estimation, especially at high noise levels, which biases guidance. Moreover, information from past runs is not reused to improve sample quality, leading to inefficient use of compute. Inspired by the success of Monte Carlo Tree Search, we address these limitations by casting inference-time alignment as a search problem that reuses past computations. We introduce a tree-based approach that _samples_ from the reward-aligned target density by propagating terminal rewards back through the diffusion chain and iteratively refining value estimates with each additional generation. Our proposed method, Diffusion Tree Sampling (DTS), produces asymptotically exact samples from the target distribution in the limit of infinite rollouts, and its greedy variant Diffusion Tree Search (DTS*) performs a robust search for high reward samples. On MNIST and CIFAR-10 class-conditional generation, DTS matches the FID of the best-performing baseline with up to
Emergent brain-like representations in a goal-directed neural network model of visual search
Motahareh Pourrahimi
Geometry-Aware Preference Learning for 3D Texture Generation
AmirHossein Zamani
Tianhao Xie
Amir Aghdam
Tiberiu Popa
Recent advances in 3D generative models have achieved impressive results but 3D contents generated by these models may not align with subjec… (voir plus)tive human preferences or task-specific criteria. Moreover, a core challenge in the 3D texture generation domain remains: most existing approaches rely on repeated calls to 2D text-to-image generative models, which lack an inherent understanding of the 3D structure of the input 3D mesh object. To address this, we propose an end-to-end differentiable preference learning framework that back-propagates human preferences, represented by differentiable reward functions, through the entire 3D generative pipeline, making the process inherently geometry-aware. We demonstrate the effectiveness of our framework using four proposed novel geometry-aware reward functions, offering a more controllable and interpretable pathway for high-quality 3D content creation from natural language.
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
Luel Hagos Beyene
Vivek Verma
Min Ma
Jesujoba Oluwadara Alabi
Fabian David Schmidt
Joyce Nakatumba-Nabende
Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as spe… (voir plus)ech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
mSTEB: Massively Multilingual Evaluation of LLMs on Speech and Text Tasks
Luel Hagos Beyene
Vivek Verma
Min Ma
Jesujoba Oluwadara Alabi
Fabian David Schmidt
Joyce Nakatumba-Nabende
Large Language models (LLMs) have demonstrated impressive performance on a wide range of tasks, including in multimodal settings such as spe… (voir plus)ech. However, their evaluation is often limited to English and a few high-resource languages. For low-resource languages, there is no standardized evaluation benchmark. In this paper, we address this gap by introducing mSTEB, a new benchmark to evaluate the performance of LLMs on a wide range of tasks covering language identification, text classification, question answering, and translation tasks on both speech and text modalities. We evaluated the performance of leading LLMs such as Gemini 2.0 Flash and GPT-4o (Audio) and state-of-the-art open models such as Qwen 2 Audio and Gemma 3 27B. Our evaluation shows a wide gap in performance between high-resource and low-resource languages, especially for languages spoken in Africa and Americas/Oceania. Our findings show that more investment is needed to address their under-representation in LLMs coverage.
Preservice Teachers’ Computational Thinking Profiles
Tanya Chichekian
Annie Savard
Robust Reward Modeling via Causal Rubrics
Pragya Srivastava
Harman Singh
Rahul Madhavan
Gandharv Patil
Sravanti Addepalli
Arun Suggala
Rengarajan Aravamudhan
Soumya Sharma
Anirban Laha
Aravindan Raghuveer
Karthikeyan Shanmugam
Reward models (RMs) for LLM alignment often exhibit reward hacking, mistaking spurious correlates (e.g., length, format) for causal quality … (voir plus)drivers (e.g., factuality, relevance), leading to brittle RMs. We introduce CROME (Causally Robust Reward Modeling), a causally-grounded framework using targeted augmentations to mitigate this. CROME employs: (1) Causal Augmentations, pairs isolating specific causal attribute changes, to enforce sensitivity, and (2) Neutral Augmentations, tie-labeled pairs varying spurious attributes while preserving causal content, to enforce invariance. Crucially, augmentations target LLM-identified causal rubrics, requiring no prior knowledge of spurious factors. CROME significantly outperforms baselines on RewardBench (Avg +5.4\%, Safety +13.2\%, Reasoning +7.2\%) and demonstrates enhanced robustness via improved Best-of-N performance across RewardBench, WildGuardTest, and GSM8k.
A Systematic Literature Review of Large Language Model Applications in the Algebra Domain
Test Time Adaptation Using Adaptive Quantile Recalibration
Paria Mehrbod
Pedro Vianna
geraldin nanfack
What Matters when Modeling Human Behavior using Imitation Learning?
Aneri Muni
Esther Derman
Vincent Taboga
As AI systems become increasingly embedded in human decision-making process, aligning their behavior with human values is critical to ensuri… (voir plus)ng safe and trustworthy deployment. A central approach to AI Alignment called Imitation Learning (IL), trains a learner to directly mimic desirable human behaviors from expert demonstrations. However, standard IL methods assume that (1) experts act to optimize expected returns; (2) expert policies are Markovian. Both assumptions are inconsistent with empirical findings from behavioral economics, according to which humans are (1) risk-sensitive; and (2) make decisions based on past experience. In this work, we examine the implications of risk sensitivity for IL and show that standard approaches do not capture all optimal policies under risk-sensitive decision criteria. By characterizing these expert policies, we identify key limitations of existing IL algorithms in replicating expert performance in risk-sensitive settings. Our findings underscore the need for new IL frameworks that account for both risk-aware preferences and temporal dependencies to faithfully align AI behavior with human experts.
Adversarial Attack Classification and Robustness Testing for Large Language Models for Code
Yang Liu
Armstrong Foundjem
Heng Li
Large Language Models (LLMs) have become vital tools in software development tasks such as code generation, completion, and analysis. As the… (voir plus)ir integration into workflows deepens, ensuring robustness against vulnerabilities especially those triggered by diverse or adversarial inputs becomes increasingly important. Such vulnerabilities may lead to incorrect or insecure code generation when models encounter perturbed task descriptions, code, or comments. Prior research often overlooks the role of natural language in guiding code tasks. This study investigates how adversarial perturbations in natural language inputs including prompts, comments, and descriptions affect LLMs for Code (LLM4Code). It examines the effects of perturbations at the character, word, and sentence levels to identify the most impactful vulnerabilities. We analyzed multiple projects (e.g., ReCode, OpenAttack) and datasets (e.g., HumanEval, MBPP), establishing a taxonomy of adversarial attacks. The first dimension classifies the input type code, prompts, or comments while the second dimension focuses on granularity: character, word, or sentence-level changes. We adopted a mixed-methods approach, combining quantitative performance metrics with qualitative vulnerability analysis. LLM4Code models show varying robustness across perturbation types. Sentence-level attacks were least effective, suggesting models are resilient to broader contextual changes. In contrast, word-level perturbations posed serious challenges, exposing semantic vulnerabilities. Character-level effects varied, showing model sensitivity to subtle syntactic deviations.Our study offers a structured framework for testing LLM4Code robustness and emphasizes the critical role of natural language in adversarial evaluation. Improving model resilience to semantic-level disruptions is essential for secure and reliable code-generation systems.
Improving Context Fidelity via Native Retrieval-Augmented Reasoning
Suyuchen Wang
Jinlin Wang
Xinyu Wang
Shiqi Li
Xiangru Tang
Sirui Hong
Xiao-Wen Chang
Chenglin Wu
Large language models (LLMs) often struggle with context fidelity, producing inconsistent answers when responding to questions based on prov… (voir plus)ided information. Existing approaches either rely on expensive supervised fine-tuning to generate evidence post-answer or train models to perform web searches without necessarily improving utilization of the given context. We propose CARE, a novel native retrieval-augmented reasoning framework that teaches LLMs to explicitly integrate in-context evidence within their reasoning process with the model's own retrieval capabilities. Our method requires minimal labeled evidence data while significantly enhancing both retrieval accuracy and answer generation performance through strategically retrieved in-context tokens in the reasoning chain. Extensive experiments on multiple real-world and counterfactual QA benchmarks demonstrate that our approach substantially outperforms supervised fine-tuning, traditional retrieval-augmented generation methods, and external retrieval solutions. This work represents a fundamental advancement in making LLMs more accurate, reliable, and efficient for knowledge-intensive tasks.