Siva Reddy

Biographie

Siva Reddy est professeur adjoint en informatique et linguistique à l’Université McGill. Ses travaux portent sur les algorithmes qui permettent aux ordinateurs de comprendre et de traiter les langues humaines. Il a fait ses études postdoctorales avec le Stanford NLP Group. Son expertise inclut la construction de symboliques linguistiques et induites et de modèles d’apprentissage profond pour le langage.

Étudiants actuels

Vaibhav Adlakha

Doctorat - McGill

Parishad BehnamGhader

Maîtrise recherche - McGill

Doctorat - McGill

Collaborateur·rice de recherche - McGill

Jerry Chen

Stagiaire de recherche - McGill

Postdoctorat - McGill

Visiteur de recherche indépendant

Co-superviseur⋅e :

Yoshua Bengio

Jay Gala

Maîtrise recherche - McGill

Co-superviseur⋅e :

Collaborateur·rice de recherche

Gaurav Kamath

Doctorat - McGill

Co-superviseur⋅e :

Timothy O'Donnell

Aditi Khandelwal

Doctorat - McGill

Superviseur⋅e principal⋅e :

Golnoosh Farnadi

Austin Kraft

Doctorat - McGill

Co-superviseur⋅e :

Doctorat - McGill

Zichao Li

Doctorat - McGill

Co-superviseur⋅e :

Jackie Cheung

Fengyuan Liu

Maîtrise recherche - McGill

Co-superviseur⋅e :

Dzmitry Bahdanau

Xing Han Lu

Doctorat - McGill

Maîtrise recherche - McGill

Doctorat - McGill

Postdoctorat - McGill

Marzia Nouri

Maîtrise recherche - McGill

Arkil Patel

Doctorat - McGill

Superviseur⋅e principal⋅e :

Collaborateur·rice de recherche - N/A

Ben Saine

Collaborateur·rice de recherche - McGill

Dongchan Shin

Collaborateur·rice alumni

Ada Tur

Stagiaire de recherche - McGill

Doctorat - McGill

Collaborateur·rice de recherche - McGill

Comment expliquer l’IA et s’assurer que cette explication est vraie? Les modèles mesurables de fidélité vous indiquent comment y parvenir

Billets de blogue

1 octobre 2024

par

Andrea Madsen

Siva Reddy

Sarath Chandar

Lire l'article

Publications

Syntactic Substitutability as Unsupervised Dependency Syntax

Jasper Jian

Syntax is a latent hierarchical structure which underpins the robust and compositional nature of human language. In this work, we explore th… (voir plus)e hypothesis that syntactic dependencies can be represented in language model attention distributions and propose a new method to induce these structures theory-agnostically. Instead of modeling syntactic relations as defined by annotation schemata, we model a more general property implicit in the definition of dependency relations, syntactic substitutability. This property captures the fact that words at either end of a dependency can be substituted with words from the same category. Substitutions can be used to generate a set of syntactically invariant sentences whose representations are then used for parsing. We show that increasing the number of substitutions used improves parsing accuracy on natural data. On long-distance subject-verb agreement constructions, our method achieves 79.5% recall compared to 8.9% using a previous method. Our method also provides improvements when transferred to a different parsing setup, demonstrating that it generalizes.

2022-12-31

EMNLP (publié)

openreview.net

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

Xing Han Lu

Harm de Vries

We introduce the StatCan Dialogue Dataset consisting of 19,379 conversation turns between agents working at Statistics Canada and online use… (voir plus)rs looking for published data tables. The conversations stem from genuine intents, are held in English or French, and lead to agents retrieving one of over 5000 complex data tables. Based on this dataset, we propose two tasks: (1) automatic retrieval of relevant tables based on a on-going conversation, and (2) automatic generation of appropriate agent responses at each turn. We investigate the difficulty of each task by establishing strong baselines. Our experiments on a temporal data split reveal that all models struggle to generalize to future conversations, as we observe a significant drop in performance across both tasks when we move from the validation to the test set. In addition, we find that response generation models struggle to decide when to return a table. Considering that the tasks pose significant challenges to existing models, we encourage the community to develop models for our task, which can be directly used to help knowledge workers find relevant tables for live chat users.

2022-12-31

EACL (publié)

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Nouha Dziri

Ehsan Kamalloo

Sivan Milton

Osmar Zaiane

Mo Yu

Edoardo M. Ponti

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sou… (voir plus)rces. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.

2022-12-22

Transactions of the Association for Computational Linguistics (publié)

Post-hoc Interpretability for Neural NLP: A Survey

Andreas Madsen

A. Chandar

2022-12-22

ACM Computing Surveys (publié)

Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for makin… (voir plus)g a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.

2022-11-30

Findings of the Association for Computational Linguistics: EMNLP 2022 (publié)

Does Entity Abstraction Help Generative Transformers Reason?

Nicolas Gontier

Christopher Pal

We study the utility of incorporating entity type abstractions into pre-trained Transformers and test these methods on four NLP tasks requir… (voir plus)ing different forms of logical reasoning: (1) compositional language understanding with text-based relational reasoning (CLUTRR), (2) abductive reasoning (ProofWriter), (3) multi-hop question answering (HotpotQA), and (4) conversational question answering (CoQA). We propose and empirically explore three ways to add such abstraction: (i) as additional input embeddings, (ii) as a separate sequence to encode, and (iii) as an auxiliary prediction task for the model. Overall, our analysis demonstrates that models with abstract entity knowledge performs better than without it. The best abstraction aware models achieved an overall accuracy of 88.8% and 91.8% compared to the baseline model achieving 62.9% and 89.8% on CLUTRR and ProofWriter respectively. However, for HotpotQA and CoQA, we find that F1 scores improve by only 0.5% on average. Our results suggest that the benefit of explicit abstraction is significant in formally defined logical reasoning settings requiring many reasoning hops, but point to the notion that it is less beneficial for NLP tasks having less formal logical structure.

2022-11-19

TMLR (accepté)

openreview.net

IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Emanuele Bugliarello

Fangyu Liu

Jonas Pfeiffer

Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of… (voir plus) a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.

2022-07-18

ICML (Accept for Short Presentation)

proceedings.mlr.press

Few-shot Question Generation for Personalized Feedback in Intelligent Tutoring Systems

Devang Kulshreshtha

Muhammad Shayan

Robert Belfer

Iulian V. Serban

Ekaterina Kochmar

2022-06-07

ArXiv (prépublication)

Compositional Generalization in Dependency Parsing

Emily Goodwin

Timothy J. O'Donnell

Dzmitry Bahdanau

Compositionality -- the ability to combine familiar units like words into novel phrases and sentences -- has been the focus of intense inter… (voir plus)est in artificial intelligence in recent years. To test compositional generalization in semantic parsing, Keysers et al. (2020) introduced Compositional Freebase Queries (CFQ). This dataset maximizes the similarity between the test and train distributions over primitive units, like words, while maximizing the compound divergence: the dissimilarity between test and train distributions over larger structures, like phrases. Dependency parsing, however, lacks a compositional generalization benchmark. In this work, we introduce a gold-standard set of dependency parses for CFQ, and use this to analyze the behavior of a state-of-the art dependency parser (Qi et al., 2020) on the CFQ dataset. We find that increasing compound divergence degrades dependency parsing performance, although not as dramatically as semantic parsing performance. Additionally, we find the performance of the dependency parser does not uniformly degrade relative to compound divergence, and the parser performs differently on different splits with the same compound divergence. We explore a number of hypotheses for what causes the non-uniform degradation in dependency parsing performance, and identify a number of syntactic structures that drive the dependency parser's lower performance on the most challenging splits.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models

Nicholas Meade

Elinor Poole-Dayan

Recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. This has attract… (voir plus)ed attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model’s language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

Image Retrieval from Contextual Descriptions

Vibhav Vineet

Edoardo Ponti

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utte… (voir plus)rance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.As such, each description contains only the details that help distinguish between images.Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames.We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

The Power of Prompt Tuning for Low-Resource Semantic Parsing

Nathan Schucher

Harm de Vries

Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language understanding and… (voir plus) generation tasks. In this paper, we investigate prompt tuning for semantic parsing—the task of mapping natural language utterances onto formal meaning representations. On the low-resource splits of Overnight and TOPv2, we find that a prompt tuned T5-xl significantly outperforms its fine-tuned counterpart, as well as strong GPT-3 and BART baselines. We also conduct ablation studies across different model scales and target representations, finding that, with increasing model scale, prompt tuned T5 models improve at generating target representations that are far from the pre-training distribution.

2022-04-30

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (publié)