Siva Reddy

Biographie

Siva Reddy est professeur adjoint en informatique et linguistique à l’Université McGill. Ses travaux portent sur les algorithmes qui permettent aux ordinateurs de comprendre et de traiter les langues humaines. Il a fait ses études postdoctorales avec le Stanford NLP Group. Son expertise inclut la construction de symboliques linguistiques et induites et de modèles d’apprentissage profond pour le langage.

Étudiants actuels

Vaibhav Adlakha

Doctorat - McGill

Maîtrise recherche - McGill

Doctorat - McGill

Stagiaire de recherche - McGill

Joyce Chai

Visiteur de recherche indépendant

Co-superviseur⋅e :

Verna Dankers

Collaborateur·rice de recherche - University of Edinburgh

Stagiaire de recherche - McGill

Desmond Elliott

Visiteur de recherche indépendant

Co-superviseur⋅e :

Jay Gala

Maîtrise recherche - McGill

Co-superviseur⋅e :

Collaborateur·rice de recherche

Gaurav Kamath

Doctorat - McGill

Co-superviseur⋅e :

Timothy O'Donnell

Imene Kerboua

Collaborateur·rice de recherche - INSA Lyon, France

Doctorat - McGill

Superviseur⋅e principal⋅e :

Golnoosh Farnadi

Austin Kraft

Doctorat - McGill

Co-superviseur⋅e :

Timothy O'Donnell

Aravind Krishnan

Collaborateur·rice alumni - UNIVERSITÄT DES SAARLANDES

Benno Krojer

Doctorat - McGill

Zichao Li

Doctorat - McGill

Co-superviseur⋅e :

Jackie Cheung

Fengyuan Liu

Maîtrise recherche - McGill

Co-superviseur⋅e :

Dzmitry Bahdanau

Xing Han Lu

Doctorat - McGill

Maîtrise recherche - McGill

Doctorat - McGill

Postdoctorat - McGill

Hanseok Oh

Collaborateur·rice alumni

Abhik Roychoudhury Roychoudhury

Arkil Patel

Doctorat - McGill

Superviseur⋅e principal⋅e :

Visiteur de recherche indépendant

Co-superviseur⋅e :

Dongchan Shin

Collaborateur·rice alumni

Karolina Ewa Stańczak

Collaborateur·rice alumni - McGill

Ivan Titov

Visiteur de recherche indépendant

Co-superviseur⋅e :

Comment expliquer l’IA et s’assurer que cette explication est vraie? Les modèles mesurables de fidélité vous indiquent comment y parvenir

Ada Tur

Stagiaire de recherche - McGill

Doctorat - McGill

Collaborateur·rice alumni - McGill

Billets de blogue

1 octobre 2024

par

Andrea Madsen

Siva Reddy

Sarath Chandar

Lire l'article

Publications

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

We explore three strategies to enhance performance on a wide range of image editing tasks: supervised fine-tuning (SFT), reinforcement learn… (voir plus)ing (RL), and Chain-of-Thought (CoT) reasoning. In order to study all these components in one consistent framework, we adopt an autoregressive multimodal model that processes textual and visual tokens in a unified manner. We find RL combined with a large multi-modal LLM verifier to be the most effective of these strategies. As a result, we release EARL: Editing with Autoregression and RL, a strong RL-based image editing model that performs competitively on a diverse range of edits compared to strong baselines, despite using much less training data. Thus, EARL pushes the frontier of autoregressive multimodal models on image editing. We release our code, training data, and trained models at https://github.com/mair-lab/EARL.

2025-08-01

ArXiv (prépublication)

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

2025-08-01

ArXiv (prépublication)

The Promise of RL for Autoregressive Image Editing

Saba Ahmadi

Rabiul Awal

Ankur Sikarwar

Ge Ya Luo

Juan A. Rodriguez

Sai Rajeswar

2025-08-01

arXiv (publié)

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

David Vazquez

Juan A. Rodriguez

Sai Rajeswar

We present WebMMU, a multilingual benchmark that evaluates three core web tasks: (1) website visual question answering, (2) code editing inv… (voir plus)olving HTML/CSS/JavaScript, and (3) mockup-to-code generation. Unlike prior benchmarks that treat these tasks separately, WebMMU unifies them using expert-annotated, real-world web data to assess models' abilities in complex multi-step reasoning, precise element grounding, and functional UI comprehension and coding. Our evaluation shows that while multimodal large language models (MLLMs) perform well on basic information extraction, they struggle with reasoning and grounding, editing code to preserve functionality, and generating design-to-code that maintains hierarchy and supports multilingual content. These findings reveal key limitations in current MLLMs and underscore the need for improved multimodal and cross-lingual reasoning to build future web agents capable of automating diverse web development tasks.

2025-08-01

arXiv (publié)

AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories

Xing Han Lu

Karolina Stanczak

Peter Shaw

Chris Pal

Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (voir plus)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io

2025-07-07

colmweb.org/COLM/2025/Conference (accepté)

openreview.net

Not All Data Are Unlearned Equally

Aravind Krishnan

Marius Mosbach

Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context… (voir plus) of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.

2025-07-07

colmweb.org/COLM/2025/Conference (accepté)

openreview.net

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Le Zhang

Bo Wang

Xipeng Qiu

Aishwarya Agrawal

We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, sign… (voir plus)ificantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.

2025-05-01

arXiv (publié)

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

2025-04-02

ArXiv (prépublication)

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (voir plus)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

2025-04-02

ArXiv (prépublication)

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Nicholas Meade

2025-03-11

ArXiv (prépublication)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Sta'nczak

2025-03-06

ArXiv (prépublication)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Esin DURMUS

Karolina Stanczak

LLM-based agents are becoming increasingly proficient at solving web-based tasks. With this capability comes a greater risk of misuse for ma… (voir plus)licious purposes, such as posting misinformation in an online forum or selling illicit substances on a website. To evaluate these risks, we propose SafeArena, the first benchmark to focus on the deliberate misuse of web agents. SafeArena comprises 250 safe and 250 harmful tasks across four websites. We classify the harmful tasks into five harm categories -- misinformation, illegal activity, harassment, cybercrime, and social bias, designed to assess realistic misuses of web agents. We evaluate leading LLM-based web agents, including GPT-4o, Claude-3.5 Sonnet, Qwen-2-VL 72B, and Llama-3.2 90B, on our benchmark. To systematically assess their susceptibility to harmful tasks, we introduce the Agent Risk Assessment framework that categorizes agent behavior across four risk levels. We find agents are surprisingly compliant with malicious requests, with GPT-4o and Qwen-2 completing 34.7% and 27.3% of harmful requests, respectively. Our findings highlight the urgent need for safety alignment procedures for web agents. Our benchmark is available here: https://safearena.github.io

2025-03-06

ArXiv (prépublication)