Portrait of Siva Reddy

Siva Reddy

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, McGill University, School of Computer Science and Department of Linguistics
Research Topics
Deep Learning
Natural Language Processing
Reasoning
Representation Learning

Biography

Siva Reddy is an assistant professor at the School of Computer Science and in the Department of Linguistics at McGill University. He completed a postdoc with the Stanford NLP Group in September 2019.

Reddy’s research goal is to enable machines with natural language understanding abilities in order to facilitate applications like question answering and conversational systems. His expertise includes building symbolic (linguistic and induced) and deep learning models for language.

Current Students

PhD - McGill University
Master's Research - McGill University
PhD - McGill University
Collaborating researcher - McGill University
Postdoctorate - McGill University
Research Intern - McGill University
Independent visiting researcher
Co-supervisor :
Master's Research - McGill University
Co-supervisor :
Collaborating researcher
PhD - McGill University
Co-supervisor :
PhD - McGill University
Principal supervisor :
PhD - McGill University
Co-supervisor :
PhD - McGill University
PhD - McGill University
Co-supervisor :
Master's Research - McGill University
Co-supervisor :
PhD - McGill University
Master's Research - McGill University
PhD - McGill University
Postdoctorate - McGill University
Master's Research - McGill University
PhD - McGill University
Principal supervisor :
Collaborating researcher - N/A
Research Intern - McGill University
Collaborating Alumni
Collaborating Alumni - McGill University
Collaborating researcher
Co-supervisor :
Research Intern - McGill University
Collaborating Alumni - McGill University
Research Intern - McGill University

Publications

Semantic change in adults is not primarily a generational phenomenon
Morgan Sonderegger
Dallas Card
A central question in the study of language change is whether or not such change is generational. If a language changes over time generation… (see more)-by-generation, the process looks as follows: New generations of speakers introduce innovations, while older speakers conserve their usage patterns, and the language changes as new generations replace older ones. At the opposite extreme, language change could be a zeitgeist phenomenon, in which changes are universally adopted by speakers simultaneously, regardless of age or generational cohort. This paper asks this question in the context of word meaning change. We analyze meaning change in over 100 words across more than 7.9 million U.S. congressional speeches, to observe whether, when a word sense rises or falls in prominence, adult speakers from different generations uniformly adopt it, or those from older generations conserve their prior usage. Using language model-based word sense induction methods, we identify different senses of each word, and then model the prevalence of each of these word senses as a function of time and speaker age. We find that most words show a small but statistically significant effect of speaker age; across almost 140 y of Congress, older speakers typically take longer than younger speakers to follow changes in word usage, but nevertheless do so within a few years. Our findings indicate that despite minor age-based differences, word meaning change among mature speakers is likely not a generational process, but rather a zeitgeist process, in which older adult speakers can readily adopt new word usage patterns.
AgentRewardBench: Evaluating Automatic Evaluations of Web Agent Trajectories
Web agents enable users to perform tasks on web browsers through natural language interaction. Evaluating web agents trajectories is an impo… (see more)rtant problem, since it helps us determine whether the agent successfully completed the tasks. Rule-based methods are widely used for this purpose, but they are challenging to extend to new tasks and may not always recognize successful trajectories. We may achieve higher accuracy through human evaluation, but the process would be substantially slower and more expensive. Automatic evaluations with LLMs may avoid the challenges of designing new rules and manually annotating trajectories, enabling faster and cost-effective evaluation. However, it is unclear how effective they are at evaluating web agents. To this end, we propose AgentRewardBench, the first benchmark to assess the effectiveness of LLM judges for evaluating web agents. AgentRewardBench contains 1302 trajectories across 5 benchmarks and 4 LLMs. Each trajectory in AgentRewardBench is reviewed by an expert, who answers questions pertaining to the success, side effects, and repetitiveness of the agent. Using our benchmark, we evaluate 12 LLM judges and find that no single LLM excels across all benchmarks. We also find that the rule-based evaluation used by common benchmarks tends to underreport the success rate of web agents, highlighting a key weakness of rule-based evaluation and the need to develop more flexible automatic evaluations. We release the benchmark at: https://agent-reward-bench.github.io
Not All Data Are Unlearned Equally
Machine unlearning is concerned with the task of removing knowledge learned from particular data points from a trained model. In the context… (see more) of large language models (LLMs), unlearning has recently received increased attention, particularly for removing knowledge about named entities from models for privacy purposes. While various approaches have been proposed to address the unlearning problem, most existing approaches treat all data points to be unlearned equally, i.e., unlearning that Montreal is a city in Canada is treated exactly the same as unlearning the phone number of the first author of this paper. In this work, we show that this all data is equal assumption does not hold for LLM unlearning. We study how the success of unlearning depends on the frequency of the knowledge we want to unlearn in the pre-training data of a model and find that frequency strongly affects unlearning, i.e., more frequent knowledge is harder to unlearn. Additionally, we uncover a misalignment between probability and generation-based evaluations of unlearning and show that this problem worsens as models become larger. Overall, our experiments highlight the need for better evaluation practices and novel methods for LLM unlearning that take the training data of models into account.
DeepSeek-R1 Thoughtology: Let's think about LLM Reasoning
Large Reasoning Models like DeepSeek-R1 mark a fundamental shift in how LLMs approach complex problems. Instead of directly producing an ans… (see more)wer for a given input, DeepSeek-R1 creates detailed multi-step reasoning chains, seemingly"thinking"about a problem before providing an answer. This reasoning process is publicly available to the user, creating endless opportunities for studying the reasoning behaviour of the model and opening up the field of Thoughtology. Starting from a taxonomy of DeepSeek-R1's basic building blocks of reasoning, our analyses on DeepSeek-R1 investigate the impact and controllability of thought length, management of long or confusing contexts, cultural and safety concerns, and the status of DeepSeek-R1 vis-\`a-vis cognitive phenomena, such as human-like language processing and world modelling. Our findings paint a nuanced picture. Notably, we show DeepSeek-R1 has a 'sweet spot' of reasoning, where extra inference time can impair model performance. Furthermore, we find a tendency for DeepSeek-R1 to persistently ruminate on previously explored problem formulations, obstructing further exploration. We also note strong safety vulnerabilities of DeepSeek-R1 compared to its non-reasoning counterpart, which can also compromise safety-aligned LLMs.
Societal Alignment Frameworks Can Improve LLM Alignment
Karolina Stanczak
Konstantin Böttinger
Jeremy Barnes
Jason Stanley
Jessica Montgomery
Richard Zemel
Nicolas Papernot
Denis Therien
Timothy P Lillicrap
Ana Marasovic
Sylvie Delacroix
Gillian K. Hadfield
Recent progress in large language models (LLMs) has focused on producing responses that meet human expectations and align with shared values… (see more) - a process coined alignment. However, aligning LLMs remains challenging due to the inherent disconnect between the complexity of human values and the narrow nature of the technological approaches designed to address them. Current alignment methods often lead to misspecified objectives, reflecting the broader issue of incomplete contracts, the impracticality of specifying a contract between a model developer, and the model that accounts for every scenario in LLM alignment. In this paper, we argue that improving LLM alignment requires incorporating insights from societal alignment frameworks, including social, economic, and contractual alignment, and discuss potential solutions drawn from these domains. Given the role of uncertainty within societal alignment frameworks, we then investigate how it manifests in LLM alignment. We end our discussion by offering an alternative view on LLM alignment, framing the underspecified nature of its objectives as an opportunity rather than perfect their specification. Beyond technical improvements in LLM alignment, we discuss the need for participatory alignment interface designs.
Large language models deconstruct the clinical intuition behind diagnosing autism
Emmett Rabot
Laurent Mottron
BigDocs: An Open Dataset for Training Multi-modal Models on Document and Code Tasks
Xiangru Jian
Akshay Kalkunte
Amirhossein Abaskohi
Pierre-Andre Noel
Sanket Biswas … (see 23 more)
Sara Shanian
Noah Bolger
Kurt MacDonald
Simon Fauvel
Sathwik Tejaswi
Srinivas Sunkara
Joao Monteiro
Krishnamurthy Dj Dvijotham
Torsten Scholak
Sepideh Kharagani
Sean Hughes
M. Özsu
Christopher Pal
Sai Rajeswar
Multimodal AI has the potential to significantly enhance document-understanding tasks, such as processing receipts, understanding workflows,… (see more) extracting data from documents, and summarizing reports. Code generation tasks that require long-structured outputs can also be enhanced by multimodality. Despite this, their use in commercial applications is often limited due to limited access to training data and restrictive licensing, which hinders open access. To address these limitations, we introduce BigDocs-7.5M, a high-quality, open-access dataset comprising 7.5 million multimodal documents across 30 tasks. We use an efficient data curation process to ensure our data is high-quality and license-permissive. Our process emphasizes accountability, responsibility, and transparency through filtering rules, traceable metadata, and careful content analysis. Additionally, we introduce BigDocs-Bench, a benchmark suite with 10 novel tasks where we create datasets that reflect real-world use cases involving reasoning over Graphical User Interfaces (GUI) and code generation from images. Our experiments show that training with BigDocs-Bench improves average performance up to 25.8% over closed-source GPT-4o in document reasoning and structured output tasks such as Screenshot2HTML or Image2Latex generation. Finally, human evaluations showed a preference for outputs from models trained on BigDocs over GPT-4o. This suggests that BigDocs can help both academics and the open-source community utilize and improve AI tools to enhance multimodal capabilities and document reasoning. The project is hosted at https://bigdocs.github.io .
MMTEB: Massive Multilingual Text Embedding Benchmark
Kenneth Enevoldsen
Isaac Chung
Márton Kardos
Ashwin Mathur
David Stap
Wissam Siblini
Dominik Krzemiński
Genta Indra Winata
Saba Sturua
Saiteja Utpala
Mathieu Ciancone
Marion Schaeffer
Gabriel Sequeira
Shreeya Dhakal
Jonathan Rystrøm
Roman Solomatin
Ömer Çağatan … (see 66 more)
Akash Kundu
Martin Bernstorff
Shitao Xiao
Akshita Sukhlecha
Bhavish Pahwa
Rafał Poświata
Kranthi Kiran GV
Shawon Ashraf
Daniel Auras
Björn Plüster
Jan Philipp Harries
Loïc Magne
Isabelle Mohr
Mariya Hendriksen
Dawei Zhu
Hippolyte Gisserot-Boukhlef
Tom Aarsen
Jan Kostkan
Konrad Wojtasik
Taemin Lee
Marek Šuppa
Crystina Zhang
Roberta Rocca
Mohammed Hamdy
Andrianos Michail
John Yang
Manuel Faysse
Aleksei Vatolin
Nandan Thakur
Dipam Vasani
Pranjal Chitale
Simone Tedeschi
Nguyen Tai
Artem Snegirev
Michael Günther
Mengzhou Xia
Weijia Shi
Jordan Clive
Gayatri Krishnakumar
Anna Maksimova
Silvan Wehrli
Maria Tikhonova
Henil Panchal
Aleksandr Abramov
Malte Ostendorff
Zheng Liu
Simon Clematide
Lester James Miranda
Alena Fenogenova
Guangyu Song
Ruqiya Bin Safi
Wen-Ding Li
Alessia Borghini
Federico Cassano
Hongjin Su
Jimmy Lin
Howard Yen
Lasse Hansen
Sara Hooker
Chenghao Xiao
Orion Weller
Niklas Muennighoff
Text embeddings are typically evaluated on a limited set of tasks, which are constrained by language, domain, and task diversity. To address… (see more) these limitations and provide a more comprehensive evaluation, we introduce the Massive Multilingual Text Embedding Benchmark (MMTEB) - a large-scale, community-driven expansion of MTEB, covering over 500 quality-controlled evaluation tasks across 250+ languages. MMTEB includes a diverse set of challenging, novel tasks such as instruction following, long-document retrieval, and code retrieval, representing the largest multilingual collection of evaluation tasks for embedding models to date. Using this collection, we develop several highly multilingual benchmarks, which we use to evaluate a representative set of models. We find that while large language models (LLMs) with billions of parameters can achieve state-of-the-art performance on certain language subsets and task categories, the best-performing publicly available model is multilingual-e5-large-instruct with only 560 million parameters. To facilitate accessibility and reduce computational cost, we introduce a novel downsampling method based on inter-task correlation, ensuring a diverse selection while preserving relative model rankings. Furthermore, we optimize tasks such as retrieval by sampling hard negatives, creating smaller but effective splits. These optimizations allow us to introduce benchmarks that drastically reduce computational demands. For instance, our newly introduced zero-shot English benchmark maintains a ranking order similar to the full-scale version but at a fraction of the computational cost.
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Instruction-following retrievers have been widely adopted alongside LLMs in real-world applications, but little work has investigated the sa… (see more)fety risks surrounding their increasing search capabilities. We empirically study the ability of retrievers to satisfy malicious queries, both when used directly and when used in a retrieval augmented generation-based setup. Concretely, we investigate six leading retrievers, including NV-Embed and LLM2Vec, and find that given malicious requests, most retrievers can (for >50% of queries) select relevant harmful passages. For example, LLM2Vec correctly selects passages for 61.35% of our malicious queries. We further uncover an emerging risk with instruction-following retrievers, where highly relevant harmful information can be surfaced by exploiting their instruction-following capabilities. Finally, we show that even safety-aligned LLMs, such as Llama3, can satisfy malicious requests when provided with harmful retrieved passages in-context. In summary, our findings underscore the malicious misuse risks associated with increasing retriever capability.
REARANK: Reasoning Re-ranking Agent via Reinforcement Learning
We present REARANK, a large language model (LLM)-based listwise reasoning reranking agent. REARANK explicitly reasons before reranking, sign… (see more)ificantly improving both performance and interpretability. Leveraging reinforcement learning and data augmentation, REARANK achieves substantial improvements over baseline models across popular information retrieval benchmarks, notably requiring only 179 annotated samples. Built on top of Qwen2.5-7B, our REARANK-7B demonstrates performance comparable to GPT-4 on both in-domain and out-of-domain benchmarks and even surpasses GPT-4 on reasoning-intensive BRIGHT benchmarks. These results underscore the effectiveness of our approach and highlight how reinforcement learning can enhance LLM reasoning capabilities in reranking.
Retreever: Tree-Based Coarse-to-Fine Representations for Retrieval
Tianyi Chen
Valentina Zantedeschi
Document retrieval is a core component of question-answering systems, as it enables conditioning answer generation on new and large-scale co… (see more)rpora. While effective, the standard practice of encoding documents into high-dimensional embeddings for similarity search entails large memory and compute footprints, and also makes it hard to inspect the inner workings of the system. In this paper, we propose a tree-based method for organizing and representing reference documents at various granular levels, which offers the flexibility to balance cost and utility, and eases the inspection of the corpus content and retrieval operations. Our method, called ReTreever, jointly learns a routing function per internal node of a binary tree such that query and reference documents are assigned to similar tree branches, hence directly optimizing for retrieval performance. Our evaluations show that ReTreever generally preserves full representation accuracy. Its hierarchical structure further provides strong coarse representations and enhances transparency by indirectly learning meaningful semantic groupings. Among hierarchical retrieval methods, ReTreever achieves the best retrieval accuracy at the lowest latency, proving that this family of techniques can be viable in practical applications.
The BrowserGym Ecosystem for Web Agent Research
Maxime Gasse
Alexandre Lacoste
Massimo Caccia
Lawrence Keunho Jang
Ori Yoran
Dehan Kong
Frank F. Xu
Graham Neubig
Ruslan Salakhutdinov
The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging a… (see more)utomation and Large Language Models (LLMs). Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. In an earlier work, Drouin et al. (2024) introduced BrowserGym which aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. We propose an extended BrowserGym-based ecosystem for web agent research, which unifies existing benchmarks from the literature and includes AgentLab, a complementary framework that aids in agent creation, testing, and analysis. Our proposed ecosystem offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across 6 popular web agent benchmarks made available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.