Siva Reddy

Biography

Siva Reddy is an assistant professor at the School of Computer Science and in the Department of Linguistics at McGill University. He completed a postdoc with the Stanford NLP Group in September 2019.

Reddy’s research goal is to enable machines with natural language understanding abilities in order to facilitate applications like question answering and conversational systems. His expertise includes building symbolic (linguistic and induced) and deep learning models for language.

Current Students

Vaibhav Adlakha

PhD - McGill University

Parishad BehnamGhader

Master's Research - McGill University

PhD - McGill University

Matteo Boglioni

Collaborating researcher - McGill University

Joyce Chai

Independent visiting researcher

Co-supervisor :

Yoshua Bengio

Verna Dankers

Postdoctorate - University of Edinburgh

Jiaqi Deng

Collaborating researcher

Charbel El Feghali

Research Intern - McGill University

Desmond Elliott

Independent visiting researcher

Co-supervisor :

Yoshua Bengio

Jay Gala

Master's Research - McGill University

Co-supervisor :

Collaborating researcher

Collaborating Alumni

PhD - McGill University

Co-supervisor :

Timothy O'Donnell

Imene Kerboua

Collaborating researcher - INSA Lyon, France

PhD - McGill University

Principal supervisor :

Golnoosh Farnadi

Austin Kraft

PhD - McGill University

Co-supervisor :

Timothy O'Donnell

Aravind Krishnan

Collaborating Alumni - UNIVERSITÄT DES SAARLANDES

Benno Krojer

PhD - McGill University

Zichao Li

PhD - McGill University

Co-supervisor :

Jackie Cheung

Fengyuan Liu

Master's Research - McGill University

Co-supervisor :

Dzmitry Bahdanau

Xing Han Lu

PhD - McGill University

Master's Research - McGill University

Nicholas Meade

PhD - McGill University

Postdoctorate - McGill University

Abhik Roychoudhury Roychoudhury

Arkil Patel

PhD - McGill University

Principal supervisor :

Collaborating researcher - N/A

Independent visiting researcher

Co-supervisor :

Collaborating Alumni

Karolina Ewa Stańczak

Collaborating Alumni - McGill University

Ivan Titov

Collaborating researcher

Co-supervisor :

Yoshua Bengio

How Do We Explain AI and Ensure the Explanation Is True? Faithfulness Measurable Models Tell You How

Ada Tur

Research Intern - McGill University

PhD - McGill University

Collaborating Alumni - McGill University

Blog Posts

October 1, 2024

Andrea Madsen

Siva Reddy

Sarath Chandar

Read the article

Publications

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

Reinforcement learning (RL) has recently become a strong recipe for training reasoning LLMs that produce long chains of thought (LongCoT). Y… (see more)et the standard RL"thinking environment", where the state is the prompt plus all prior reasoning tokens, makes the state unbounded and forces attention-based policies to pay quadratic compute as thoughts lengthen. We revisit the environment itself. We propose Markovian Thinking, a paradigm in which the policy advances reasoning while conditioning on a constant-size state, decoupling thinking length from context size. As an immediate consequence this yields linear compute with constant memory. We instantiate this idea with Delethink, an RL environment that structures reasoning into fixed-size chunks. Within each chunk, the model thinks as usual; at the boundary, the environment resets the context and reinitializes the prompt with a short carryover. Through RL, the policy learns to write a textual state near the end of each chunk sufficient for seamless continuation of reasoning after reset. Trained in this environment, an R1-Distill 1.5B model reasons in 8K-token chunks yet thinks up to 24K tokens, matching or surpassing LongCoT-RL trained with a 24K budget. With test-time scaling, Delethink continues to improve where LongCoT plateaus. The effect of linear compute is substantial: we empirically estimate at 96K average thinking length LongCoT-RL costs 27 H100-months vs. 7 for Delethink. Analysis at RL initialization shows off-the-shelf reasoning models (1.5B-120B) often sample Markovian traces zero-shot across diverse benchmarks, providing positive samples that make RL effective at scale. Our results show that redesigning the thinking environment is a powerful lever: it enables very long reasoning without quadratic overhead and opens a path toward efficient, scalable reasoning LLMs.

2025-10-08

ArXiv (preprint)

The Markovian Thinker

Milad Aghajohari

Kamran Chitsaz

Amirhossein Kazemnejad

2025-10-08

ArXiv (preprint)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Stanczak

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for ma… (see more)licious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, a benchmark focused on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories—misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Stanczak

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

Milad Aghajohari

Large language models (LLMs) are increasingly applied to complex reasoning tasks that require executing several complex steps before receivi… (see more)ng any reward. Properly assigning credit to these steps is essential for enhancing model performance. Proximal Policy Optimization (PPO), a common reinforcement learning (RL) algorithm used for LLM finetuning, employs value networks to tackle credit assignment. However, recent approaches achieve strong results without it, raising questions about the efficacy of value networks in practice. In this work, we systematically evaluate the efficacy of value networks and reveal their significant shortcomings in reasoning-heavy LLM tasks, showing that they often produce poor estimate of expected return and barely outperform a random baseline when comparing alternative steps. This motivates our key question: Can improved credit assignment enhance RL training for LLMs? To address this, we propose VinePPO, a straightforward approach that leverages the flexibility of language environments to compute unbiased Monte Carlo-based estimates. Our method consistently outperforms PPO and other baselines across MATH and GSM8K datasets in less wall-clock time (up to 3.0x). Crucially, it achieves higher test accuracy for a given training accuracy, capturing more generalization signal per sample. These results emphasize the importance of accurate credit assignment in RL training of LLM.

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

VinePPO: Refining Credit Assignment in RL Training of LLMs

Amirhossein Kazemnejad

Milad Aghajohari

2025-10-06

Proceedings of the 42nd International Conference on Machine Learning (published)

proceedings.mlr.press

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel

Siva Reddy

Dzmitry Bahdanau

The pace of evolution of Large Language Models (LLMs) necessitates new approaches for rigorous and comprehensive evaluation. Traditional hum… (see more)an annotation is increasingly impracticable due to the complexities and costs involved in generating high-quality, challenging problems, particularly for tasks such as long-context reasoning. Moreover, the rapid saturation of existing human-curated benchmarks by LLMs further necessitates the need to develop scalable and automatically renewable evaluation methodologies. In this work, we introduce **CHASE**, a unified framework to synthetically generate challenging problems using LLMs without human involvement. For a given task, our approach builds a hard problem in a bottom-up manner from simpler components. Moreover since we want to generate synthetic data for evaluation, our framework decomposes the generation process into independently verifiable sub-tasks, thereby ensuring a high level of quality and correctness. We implement CHASE to create evaluation benchmarks across three diverse domains: document-based question answering, repository-level code completion, and math reasoning. The performance of state-of-the-art LLMs on these synthetic benchmarks lies in the range of 40-60\% accuracy, thereby demonstrating the effectiveness of our framework at generating hard problems. Our experiments further reveal that the Gemini models significantly outperform other LLMs at long-context reasoning, and that the performance of all LLMs drastically drops by as much as 70\% when we scale up the context size to 50k tokens.

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

How to Get Your LLM to Generate Challenging Problems for Evaluation

Arkil Patel

Siva Reddy

Dzmitry Bahdanau

2025-09-23

NeurIPS.cc/2025/Workshop/LLM_Evaluation (poster)

Towards Democratizing LLMs: Investigating Multilingual Mixture-of-Experts Models

2025-09-22

NeurIPS.cc/2025/Workshop/WiML (published)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Mahsa Massoud

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (see more)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models'abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

2025-08-22

ArXiv (preprint)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Mahsa Massoud

Juan A. Rodriguez

Sai Rajeswar

ServiceNow

WebMMU Benchmark

2025-08-22

ArXiv (preprint)

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Amirhossein Kazemnejad

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (see more)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

2025-08-01

arXiv (published)