Sivan Milton

Alumni

Publications

FaithDial: A Faithful Benchmark for Information-Seeking Dialogue

Nouha Dziri

Ehsan Kamalloo

Sivan Milton

Osmar Zaiane

Mo Yu

Edoardo M. Ponti

Siva Reddy

The goal of information-seeking dialogue is to respond to seeker queries with natural language utterances that are grounded on knowledge sou… (see more)rces. However, dialogue systems often produce unsupported utterances, a phenomenon known as hallucination. To mitigate this behavior, we adopt a data-centric solution and create FaithDial, a new benchmark for hallucination-free dialogues, by editing hallucinated responses in the Wizard of Wikipedia (WoW) benchmark. We observe that FaithDial is more faithful than WoW while also maintaining engaging conversations. We show that FaithDial can serve as training signal for: i) a hallucination critic, which discriminates whether an utterance is faithful or not, and boosts the performance by 12.8 F1 score on the BEGIN benchmark compared to existing datasets for dialogue coherence; ii) high-quality dialogue generation. We benchmark a series of state-of-the-art models and propose an auxiliary contrastive objective that achieves the highest level of faithfulness and abstractiveness based on several automated metrics. Further, we find that the benefits of FaithDial generalize to zero-shot transfer on other datasets, such as CMU-Dog and TopicalChat. Finally, human evaluation reveals that responses generated by models trained on FaithDial are perceived as more interpretable, cooperative, and engaging.

2022-12-22

Transactions of the Association for Computational Linguistics (published)

doi.org

arxiv.org

On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models?

Nouha Dziri

Sivan Milton

Mo Yu

Osmar Zaiane

Siva Reddy

Knowledge-grounded conversational models are known to suffer from producing factually invalid statements, a phenomenon commonly called hallu… (see more)cination. In this work, we investigate the underlying causes of this phenomenon: is hallucination due to the training data, or to the models? We conduct a comprehensive human study on both existing knowledge-grounded conversational benchmarks and several state-of-the-art models. Our study reveals that the standard benchmarks consist of >60% hallucinated responses, leading to models that not only hallucinate but even amplify hallucinations. Our findings raise important questions on the quality of existing datasets and models trained using them. We make our annotations publicly available for future research.

2021-12-31

arXiv (preprint)

doi.org

arxiv.org

AI Policy Compass

AI Policy Fellowship Publications

Mila Ventures Launchpad

Sivan Milton

Publications

AI Policy Compass

AI Policy Fellowship Publications

Mila Ventures Launchpad

Popular keywords:

Sivan Milton

Publications