Portrait of Siva Reddy

Siva Reddy

Core Academic Member
Canada CIFAR AI Chair
Assistant Professor, McGill University, School of Computer Science and Department of Linguistics
Research Topics
Deep Learning
Natural Language Processing
Reasoning
Representation Learning

Biography

Siva Reddy is an assistant professor at the School of Computer Science and in the Department of Linguistics at McGill University. He completed a postdoc with the Stanford NLP Group in September 2019.

Reddy’s research goal is to enable machines with natural language understanding abilities in order to facilitate applications like question answering and conversational systems. His expertise includes building symbolic (linguistic and induced) and deep learning models for language.

Current Students

PhD - McGill University
Master's Research - McGill University
PhD - McGill University
Collaborating researcher
PhD - McGill University
Master's Research - McGill University
PhD - McGill University
Principal supervisor :
PhD - McGill University
Research Intern - UNIVERSITÄT DES SAARLANDES
PhD - McGill University
PhD - McGill University
Co-supervisor :
PhD - Polytechnique Montréal
Principal supervisor :
PhD - McGill University
Postdoctorate - McGill University
PhD - McGill University
Principal supervisor :
Research Intern - McGill University
Postdoctorate - McGill University
Research Intern - McGill University
Collaborating researcher - Cambridge University
Research Intern - McGill University

Publications

On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?
Nouha Dziri
Sivan Milton
Mo Yu
Osmar R Zaiane
Knowledge-grounded conversational models are known to suffer from producing factually invalid statements, a phenomenon commonly called hallu… (see more)cination. In this work, we investigate the underlying causes of this phenomenon: is hallucination due to the training data, or to the models? We conduct a comprehensive human study on both existing knowledge-grounded conversational benchmarks and several state-of-the-art models. Our study reveals that the standard benchmarks consist of > 60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations. Our findings raise important questions on the quality of existing datasets and models trained using them. We make our annotations publicly available for future research.
TopiOCQA: Open-domain Conversational Question Answering with Topic Switching
Vaibhav Adlakha
Shehzaad Dhuliawala
Kaheer Suleman
Harm de Vries
Using Interactive Feedback to Improve the Accuracy and Explainability of Question Answering Systems Post-Deployment
Zichao Li
Prakhar Sharma
Xing Han Lu
Most research on question answering focuses on the pre-deployment stage; i.e., building an accurate model for deployment.In this paper, we a… (see more)sk the question: Can we improve QA systems further post-deployment based on user interactions? We focus on two kinds of improvements: 1) improving the QA system’s performance itself, and 2) providing the model with the ability to explain the correctness or incorrectness of an answer.We collect a retrieval-based QA dataset, FeedbackQA, which contains interactive feedback from users. We collect this dataset by deploying a base QA system to crowdworkers who then engage with the system and provide feedback on the quality of its answers.The feedback contains both structured ratings and unstructured natural language explanations.We train a neural model with this feedback data that can generate explanations and re-score answer candidates. We show that feedback data not only improves the accuracy of the deployed QA system but also other stronger non-deployed systems. The generated explanations also help users make informed decisions about the correctness of answers.
Image Retrieval from Contextual Descriptions
Benno Krojer
Vaibhav Adlakha
Vibhav Vineet
Yash Goyal
Edoardo Ponti
The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utte… (see more)rance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.As such, each description contains only the details that help distinguish between images.Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames.We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.
Combining Modular Skills in Multitask Learning
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Emanuele Bugliarello
Fangyu Liu
Jonas Pfeiffer
Desmond Elliott
Edoardo Ponti
Ivan Vulic
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of… (see more) a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.
The Curious Case of Absolute Position Embeddings
Koustuv Sinha
Amirhossein Kazemnejad
Dieuwke Hupkes
Adina Williams
End-to-End Training of Multi-Document Reader and Retriever for Open-Domain Question Answering
Devendra Singh Sachan
William Hamilton
Chris Dyer
Dani Yogatama
We present an end-to-end differentiable training method for retrieval-augmented open-domain question answering systems that combine informat… (see more)ion from multiple retrieved documents when generating answers. We model retrieval decisions as latent variables over sets of relevant documents. Since marginalizing over sets of retrieved documents is computationally hard, we approximate this using an expectation-maximization algorithm. We iteratively estimate the value of our latent variable (the set of relevant documents for a given question) and then use this estimate to update the retriever and reader parameters. We hypothesize that such end-to-end training allows training signals to flow to the reader and then to the retriever better than staged-wise training. This results in a retriever that is able to select more relevant documents for a question and a reader that is trained on more accurate documents to generate an answer. Experiments on three benchmark datasets demonstrate that our proposed method outperforms all existing approaches of comparable size by 2-3% absolute exact match points, achieving new state-of-the-art results. Our results also demonstrate the feasibility of learning to retrieve to improve answer generation without explicit supervision of retrieval decisions.
Back-Training excels Self-Training at Unsupervised Domain Adaptation of Question Generation and Passage Retrieval
Devang Kulshreshtha
Robert Belfer
Iulian V. Serban
In this work, we introduce back-training, an alternative to self-training for unsupervised domain adaptation (UDA). While self-training gene… (see more)rates synthetic training data where natural inputs are aligned with noisy outputs, back-training results in natural outputs aligned with noisy inputs. This significantly reduces the gap between target domain and synthetic data distribution, and reduces model overfitting to source domain. We run UDA experiments on question generation and passage retrieval from the Natural Questions domain to machine learning and biomedical domains. We find that back-training vastly outperforms self-training by a mean improvement of 7.8 BLEU-4 points on generation, and 17.6% top-20 retrieval accuracy across both domains. We further propose consistency filters to remove low-quality synthetic data before training. We also release a new domain-adaptation dataset - MLQuestions containing 35K unaligned questions, 50K unaligned passages, and 3K aligned question-passage pairs.
Visually Grounded Reasoning across Languages and Cultures
Fangyu Liu
Emanuele Bugliarello
Edoardo Ponti
Nigel Collier
Desmond Elliott
The design of widespread vision-and-language datasets and pre-trained encoders directly adopts, or draws inspiration from, the concepts and … (see more)images of ImageNet. While one can hardly overestimate how much this benchmark contributed to progress in computer vision, it is mostly derived from lexical databases and image queries in English, resulting in source material with a North American or Western European bias. Therefore, we devise a new protocol to construct an ImageNet-style hierarchy representative of more languages and cultures. In particular, we let the selection of both concepts and images be entirely driven by native speakers, rather than scraping them automatically. Specifically, we focus on a typologically diverse set of languages, namely, Indonesian, Mandarin Chinese, Swahili, Tamil, and Turkish. On top of the concepts and images obtained through this new protocol, we create a multilingual dataset for Multicultural Reasoning over Vision and Language (MaRVL) by eliciting statements from native speaker annotators about pairs of images. The task consists of discriminating whether each grounded statement is true or false. We establish a series of baselines using state-of-the-art models and find that their cross-lingual transfer performance lags dramatically behind supervised performance in English. These results invite us to reassess the robustness and accuracy of current state-of-the-art models beyond a narrow domain, but also open up new exciting challenges for the development of truly multilingual and multicultural systems.
An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models
Nicholas Meade
Elinor Poole-Dayan
Recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. This has attract… (see more)ed attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model’s language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.
The Power of Prompt Tuning for Low-Resource Semantic Parsing
Nathan Schucher
Harm de Vries
Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language understanding and… (see more) generation tasks. In this paper, we investigate prompt tuning for semantic parsing—the task of mapping natural language utterances onto formal meaning representations. On the low-resource splits of Overnight and TOPv2, we find that a prompt tuned T5-xl significantly outperforms its fine-tuned counterpart, as well as strong GPT-3 and BART baselines. We also conduct ablation studies across different model scales and target representations, finding that, with increasing model scale, prompt tuned T5 models improve at generating target representations that are far from the pre-training distribution.