Siva Reddy

Biographie

Siva Reddy est professeur adjoint en informatique et linguistique à l’Université McGill. Ses travaux portent sur les algorithmes qui permettent aux ordinateurs de comprendre et de traiter les langues humaines. Il a fait ses études postdoctorales avec le Stanford NLP Group. Son expertise inclut la construction de symboliques linguistiques et induites et de modèles d’apprentissage profond pour le langage.

Étudiants actuels

Vaibhav Adlakha

Doctorat - McGill

Parishad BehnamGhader

Maîtrise recherche - McGill

Doctorat - McGill

Collaborateur·rice de recherche - McGill

Verna Dankers

Postdoctorat - University of Edinburgh

Jiaqi Deng

Collaborateur·rice de recherche

Charbel El Feghali

Stagiaire de recherche - McGill

Desmond Elliott

Visiteur de recherche indépendant

Jay Gala

Maîtrise recherche - McGill

Co-superviseur⋅e :

Collaborateur·rice de recherche

Hanseok Hanseok Oh

Collaborateur·rice alumni

Doctorat - McGill

Co-superviseur⋅e :

Timothy O'Donnell

Imene Kerboua

Collaborateur·rice de recherche - INSA Lyon, France

Doctorat - McGill

Superviseur⋅e principal⋅e :

Golnoosh Farnadi

Austin Kraft

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Zichao Li

Doctorat - McGill

Co-superviseur⋅e :

Jackie Cheung

Fengyuan Liu

Maîtrise recherche - McGill

Co-superviseur⋅e :

Dzmitry Bahdanau

Xing Han Lu

Doctorat - McGill

Maîtrise recherche - McGill

Doctorat - McGill

Postdoctorat - McGill

Marzia Nouri

Maîtrise recherche - McGill

Arkil Patel

Doctorat - McGill

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - N/A

Ben Saine

Stagiaire de recherche - McGill

Dongchan Shin

Collaborateur·rice alumni

Karolina Ewa Stańczak

Collaborateur·rice alumni - McGill

Ivan Titov

Collaborateur·rice de recherche

Ada Tur

Stagiaire de recherche - McGill

Doctorat - McGill

Collaborateur·rice alumni - McGill

Donghao Zeng

Stagiaire de recherche - McGill

Comment expliquer l’IA et s’assurer que cette explication est vraie? Les modèles mesurables de fidélité vous indiquent comment y parvenir

Billets de blogue

1 octobre 2024

par

Andrea Madsen

Siva Reddy

Sarath Chandar

Lire l'article

Publications

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (voir plus)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

2025-10-08

ArXiv (prépublication)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

2025-10-08

ArXiv (prépublication)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Stanczak

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Stanczak

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for ma… (voir plus)licious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, a benchmark focused on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories—misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

VinePPO: Refining Credit Assignment in RL Training of LLMs

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (voir plus)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

VinePPO: Refining Credit Assignment in RL Training of LLMs

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (publié)

proceedings.mlr.press

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel

Siva Reddy

Dzmitry Bahdanau

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (voir plus)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel

Siva Reddy

Dzmitry Bahdanau

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

Towards Democratizing LLMs: Investigating Multilingual Mixture-of-Experts Models

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (publié)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

2025-08-22

ArXiv (prépublication)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

2025-08-22

ArXiv (prépublication)

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

2025-08-01

ArXiv (prépublication)