Siva Reddy

aristides.milios@mila.quebec

Arkil Patel

Doctorat - McGill University

Superviseur⋅e principal⋅e :

arkil.patel@mila.quebec

Doctorat - McGill University

benno.krojer@mila.quebec

gaurav.kamath@mila.quebec

Gaurav Kamath

Doctorat - McGill University

Karolina Ewa Stańczak

Postdoctorat - McGill University

karolina.stanczak@mila.quebec

Doctorat - McGill University

laurestine.bradford@mila.quebec

Marius Mosbach

Postdoctorat - McGill University

marius.mosbach@mila.quebec

nicholas.meade@mila.quebec

Nicholas Meade

Doctorat - McGill University

Github

parishad.behnamghader@mila.quebec

Parishad BehnamGhader

Maîtrise recherche - McGill University

Collaborateur·rice de recherche

spandana.gella@mila.quebec

vaibhav.adlakha@mila.quebec

Vaibhav Adlakha

Doctorat - McGill University

Xing Han Lu

Doctorat - McGill University

Doctorat - None

zdenek.kasner@mila.quebec

Zichao Li

Doctorat - McGill University

Co-superviseur⋅e :

Jackie Cheung

zichao.li@mila.quebec

Publications

MAGNIFICo: Evaluating the In-Context Learning Ability of Large Language Models to Generalize to Novel Interpretations

Arkil Patel

Satwik Bhattamishra

2023-01-01

EMNLP (publié)

openreview.net

Syntactic Substitutability as Unsupervised Dependency Syntax

Jasper Jian

2023-01-01

EMNLP (publié)

openreview.net

The StatCan Dialogue Dataset: Retrieving Data Tables through Conversations with Genuine Intents

Xing Han Lu

Harm de Vries

2023-01-01

EACL (publié)

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Nouha Dziri

Ehsan Kamalloo

Sivan Milton

Osmar Zaiane

Mo Yu

Edoardo Ponti

Abstract The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on know… (voir plus)ledge sources. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.

2022-12-23

Transactions of the Association for Computational Linguistics (publié)

Post-hoc Interpretability for Neural NLP: A Survey

Andreas Madsen

Sarath Chandar Anbil Parthipan

2022-12-23

ACM Computing Surveys (publié)

Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining

Andreas Madsen

Nicholas Meade

Vaibhav Adlakha

To explain NLP models a popular approach is to use importance measures, such as attention, which inform input tokens are important for makin… (voir plus)g a prediction. However, an open question is how well these explanations accurately reflect a model's logic, a property called faithfulness. To answer this question, we propose Recursive ROAR, a new faithfulness metric. This works by recursively masking allegedly important tokens and then retraining the model. The principle is that this should result in worse model performance compared to masking random tokens. The result is a performance curve given a masking-ratio. Furthermore, we propose a summarizing metric using relative area-between-curves (RACU), which allows for easy comparison across papers, models, and tasks. We evaluate 4 different importance measures on 8 different datasets, using both LSTM-attention models and RoBERTa models. We find that the faithfulness of importance measures is both model-dependent and task-dependent. This conclusion contradicts previous evaluations in both computer vision and faithfulness of attention literature.

2022-12-01

Findings of the Association for Computational Linguistics: EMNLP 2022 (publié)

Does Entity Abstraction Help Generative Transformers Reason?

Nicolas Gontier

Chris Pal

We study the utility of incorporating entity type abstractions into pre-trained Transformers and test these methods on four NLP tasks requir… (voir plus)ing different forms of logical reasoning: (1) compositional language understanding with text-based relational reasoning (CLUTRR), (2) abductive reasoning (ProofWriter), (3) multi-hop question answering (HotpotQA), and (4) conversational question answering (CoQA). We propose and empirically explore three ways to add such abstraction: (i) as additional input embeddings, (ii) as a separate sequence to encode, and (iii) as an auxiliary prediction task for the model. Overall, our analysis demonstrates that models with abstract entity knowledge performs better than without it. The best abstraction aware models achieved an overall accuracy of 88.8% and 91.8% compared to the baseline model achieving 62.9% and 89.8% on CLUTRR and ProofWriter respectively. However, for HotpotQA and CoQA, we find that F1 scores improve by only 0.5% on average. Our results suggest that the benefit of explicit abstraction is significant in formally defined logical reasoning settings requiring many reasoning hops, but point to the notion that it is less beneficial for NLP tasks having less formal logical structure.

2022-11-20

TMLR (accepted)

openreview.net

Few-shot Question Generation for Personalized Feedback in Intelligent Tutoring Systems

Devang Kulshreshtha

Muhammad Shayan

Robert Belfer

Iulian V. Serban

Ekaterina Kochmar

2022-06-08

ArXiv (prépublication)

Compositional Generalization in Dependency Parsing

Emily D. Goodwin

Timothy O'Donnell

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

An Empirical Survey of the Effectiveness of Debiasing Techniques for Pre-trained Language Models

Nicholas Meade

Elinor Poole-Dayan

Recent work has shown pre-trained language models capture social biases from the large amounts of text they are trained on. This has attract… (voir plus)ed attention to developing techniques that mitigate such biases. In this work, we perform an empirical survey of five recently proposed bias mitigation techniques: Counterfactual Data Augmentation (CDA), Dropout, Iterative Nullspace Projection, Self-Debias, and SentenceDebias. We quantify the effectiveness of each technique using three intrinsic bias benchmarks while also measuring the impact of these techniques on a model’s language modeling ability, as well as its performance on downstream NLU tasks. We experimentally find that: (1) Self-Debias is the strongest debiasing technique, obtaining improved scores on all bias benchmarks; (2) Current debiasing techniques perform less consistently when mitigating non-gender biases; And (3) improvements on bias benchmarks such as StereoSet and CrowS-Pairs by using debiasing strategies are often accompanied by a decrease in language modeling ability, making it difficult to determine whether the bias mitigation was effective.

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

Image Retrieval from Contextual Descriptions

Benno Krojer

Vaibhav Adlakha

Vibhav Vineet

Yash Goyal

Edoardo Ponti

The ability to integrate context, including perceptual and temporal cues, plays a pivotal role in grounding the meaning of a linguistic utte… (voir plus)rance. In order to measure to what extent current vision-and-language models master this ability, we devise a new multimodal challenge, Image Retrieval from Contextual Descriptions (ImageCoDe). In particular, models are tasked with retrieving the correct image from a set of 10 minimally contrastive candidates based on a contextual description.As such, each description contains only the details that help distinguish between images.Because of this, descriptions tend to be complex in terms of syntax and discourse and require drawing pragmatic inferences. Images are sourced from both static pictures and video frames.We benchmark several state-of-the-art models, including both cross-encoders such as ViLBERT and bi-encoders such as CLIP, on ImageCoDe.Our results reveal that these models dramatically lag behind human performance: the best variant achieves an accuracy of 20.9 on video frames and 59.4 on static pictures, compared with 90.8 in humans.Furthermore, we experiment with new model variants that are better equipped to incorporate visual and temporal context into their representations, which achieve modest gains. Our hope is that ImageCoDE will foster progress in grounded language understanding by encouraging models to focus on fine-grained visual differences.

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (publié)

The Power of Prompt Tuning for Low-Resource Semantic Parsing

Nathan Schucher

Harm de Vries

Prompt tuning has recently emerged as an effective method for adapting pre-trained language models to a number of language understanding and… (voir plus) generation tasks. In this paper, we investigate prompt tuning for semantic parsing—the task of mapping natural language utterances onto formal meaning representations. On the low-resource splits of Overnight and TOPv2, we find that a prompt tuned T5-xl significantly outperforms its fine-tuned counterpart, as well as strong GPT-3 and BART baselines. We also conduct ablation studies across different model scales and target representations, finding that, with increasing model scale, prompt tuned T5 models improve at generating target representations that are far from the pre-training distribution.

2022-05-01

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (publié)