Siva Reddy

Biographie

Siva Reddy est professeur adjoint en informatique et linguistique à l’Université McGill. Ses travaux portent sur les algorithmes qui permettent aux ordinateurs de comprendre et de traiter les langues humaines. Il a fait ses études postdoctorales avec le Stanford NLP Group. Son expertise inclut la construction de symboliques linguistiques et induites et de modèles d’apprentissage profond pour le langage.

Étudiants actuels

Vaibhav Adlakha

Doctorat - McGill

Parishad BehnamGhader

Maîtrise recherche - McGill

Doctorat - McGill

Collaborateur·rice de recherche

Gaurav Kamath

Doctorat - McGill

Aditi Khandelwal

Doctorat - McGill

Superviseur⋅e principal⋅e :

Doctorat - McGill

Co-superviseur⋅e :

Timothy O'Donnell

Aravind Krishnan

Collaborateur·rice alumni - UNIVERSITÄT DES SAARLANDES

Doctorat - McGill

Zichao Li

Doctorat - McGill

Co-superviseur⋅e :

Jackie Cheung

Xing Han Lu

Doctorat - McGill

Stagiaire de recherche - McGill

Doctorat - McGill

Postdoctorat - McGill

Oh Oh

Collaborateur·rice de recherche

Arkil Patel

Doctorat - McGill

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche

Karolina Ewa Stańczak

Collaborateur·rice alumni - McGill

Comment expliquer l’IA et s’assurer que cette explication est vraie? Les modèles mesurables de fidélité vous indiquent comment y parvenir

Ada Tur

Stagiaire de recherche - McGill

Collaborateur·rice alumni - McGill

Billets de blogue

1 octobre 2024

par

Andrea Madsen

Siva Reddy

Sarath Chandar

Lire l'article

Publications

REARANK: Reasoning Re-ranking Agent via Reinforcement Learning

Le Zhang

Bo Wang

Xipeng Qiu

Aishwarya Agrawal

We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, sign… (voir plus)ificantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.

2025-05-26

ArXiv (prépublication)

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Parishad BehnamGhader

Mehar Bhatia

Aditi Khandelwal

Austin Kraft

Benno Krojer

Xing Han Lu

Nicholas Meade

Dongchan Shin

Amirhossein Kazemnejad

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

2025-04-02

ArXiv (prépublication)

DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning

Sara Vera Marjanovi'c

Arkil Patel

Vaibhav Adlakha

Milad Aghajohari

Parishad BehnamGhader

Mehar Bhatia

Aditi Khandelwal

Austin Kraft

Benno Krojer

Xing Han Lu

Nicholas Meade

Dongchan Shin

Amirhossein Kazemnejad

Gaurav Kamath

Marius Mosbach

Karolina Stanczak

Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (voir plus)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.

2025-04-02

ArXiv (prépublication)

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Parishad BehnamGhader

Nicholas Meade

2025-03-11

ArXiv (prépublication)

The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier de Chezelles

Maxime Gasse

Alexandre Lacoste

Alexandre Drouin

Massimo Caccia

Léo Boisvert

Megh Thakkar

Tom Marty

Rim Assouel

Sahar Omidi Shayegan

Lawrence Keunho Jang

Xing Han Lu

Ori Yoran

Dehan Kong

Frank F. Xu

Quentin Cappart

Graham Neubig

Russ Salakhutdinov

Nicolas Chapados

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (voir plus)utomation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

2025-03-08

TMLR (accepté)

openreview.net

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Nicholas Meade

Xing Han Lu

Alejandra Zambrano

Arkil Patel

Esin Durmus

Spandana Gella

Karolina Sta'nczak

2025-03-06

ArXiv (prépublication)

Societal Alignment Frameworks Can Improve LLM Alignment

Karolina Stanczak

Nicholas Meade

Mehar Bhatia

Hattie Zhou

Konstantin Böttinger

Jeremy Barnes

Jason Stanley

Jessica Montgomery

Richard Zemel

Nicolas Papernot

Nicolas Chapados

Denis Therien

Timothy P. Lillicrap

Ana Marasovic

Sylvie Delacroix

Gillian K. Hadfield

Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values… (voir plus) - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.

2025-03-05

ICLR.cc/2025/Workshop/Bi-Align (poster)

openreview.net

WebMMU: A Benchmark for Multimodal Multilingual Website Understanding and Code Generation

Rabiul Awal

Mahsa Massoud

Zichao Li

Aarash Feizi

Suyuchen Wang

Chris Pal

Aishwarya Agrawal

David Vazquez

Juan A. Rodriguez

Perouz Taslakian

Spandana Gella

Sai Rajeswar

Understanding diverse web data and automating web development presents an exciting challenge for agentic AI. While existing benchmarks addre… (voir plus)ss isolated web-based tasks—such as website-based Visual Question Answering (VQA) and UI-to-code generation—they lack a unified evaluation suite for assessing web agents that interact with and reason about web environments. We introduce WebMMU, a large-scale benchmark for evaluating AI-driven web agents across multilingual website VQA, HTML/CSS/JavaScript code editing, and sketch-to-code generation. WebMMU provides a comprehensive evaluation suite with real-world website data, multi-step reasoning tasks, and functional UI understanding. Benchmarking state-of-the-art multimodal models on WebMMU reveals significant limitations in web-based reasoning, layout understanding, and structured code generation, particularly in preserving UI hierarchy, handling multilingual content, and producing robust, functional code. While most existing models are optimized for English-only settings, WebMMU highlights the challenges of cross-lingual adaptation in real-world web development. These findings expose critical gaps in current models’ ability to understand website structures, execute user instructions, and generate high-quality web code, underscoring the need for more advanced multimodal reasoning in AI-driven web understanding and development.

2025-03-05

ICLR.cc/2025/Workshop/DL4C (publié)

openreview.net

Exploiting Instruction-Following Retrievers for Malicious Information Retrieval

Parishad BehnamGhader

Nicholas Meade

Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the sa… (voir plus)fety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.

2025-03-01

arXiv (publié)

Large language models deconstruct the clinical intuition behind diagnosing autism

Jack Stanley

Emmett Rabot

Eugene Belilovsky

L. Mottron

Danilo Bzdok

2025-03-01

Cell (publié)

SafeArena: Evaluating the Safety of Autonomous Web Agents

Ada Defne Tur

Nicholas Meade

Xing Han Lu

Alejandra Zambrano

Arkil Patel

Esin Durmus

Spandana Gella

Karolina Stanczak

2025-03-01

arXiv (publié)

Societal Alignment Frameworks Can Improve LLM Alignment

Karolina Sta'nczak

Nicholas Meade

Mehar Bhatia

Hattie Zhou

Konstantin Böttinger

Jeremy Barnes

Jason Stanley

Jessica Montgomery

Richard Zemel

Nicolas Papernot

Nicolas Chapados

Denis Therien

Timothy P. Lillicrap

Ana Marasovi'c

Sylvie Delacroix

Gillian K. Hadfield

2025-02-27

ArXiv (prépublication)