Publications

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping
Lucas Lehnert
Sainbayar Sukhbaatar
DiJia Su
Paul McVay
Qinqing Zheng
Yuandong Tian
While Transformers have enabled tremendous progress in various application settings, such architectures still lag behind traditional symboli… (see more)c planners for solving complex decision making tasks. In this work, we demonstrate how to train Transformers to solve complex planning tasks. This is accomplished by training an encoder-decoder Transformer model to predict the _search dynamics_ of the
Chain of Targeted Verification Questions to Improve the Reliability of Code Generated by LLMs
Sylvain Kouemo Ngassom
Arghavan Moradi Dakhel
Florian Tambon
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison
Qian Yang
Weixiang Yan
Despite tremendous advancements, current state-of-the-art Vision-Language Models (VLMs) are still far from perfect. They tend to hallucinate… (see more) and may generate biased responses. In such circumstances, having a way to assess the reliability of a given response generated by a VLM is quite useful. Existing methods, such as estimating uncertainty using answer likelihoods or prompt-based confidence generation, often suffer from overconfidence. Other methods use self-consistency comparison but are affected by confirmation biases. To alleviate these, we propose \textbf{De}compose and \textbf{C}ompare \textbf{C}onsistency (\texttt{DeCC}) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, and the indirect answers obtained by decomposing the question into sub-questions and reasoning over the sub-answers produced by the VLM, \texttt{DeCC} measures the reliability of VLM's direct answer. Experiments across six vision-language tasks with three VLMs show \texttt{DeCC}'s reliability estimation achieves better correlation with task accuracy compared to the existing methods.
Guiding Language Model Reasoning with Planning Tokens
Xinyi Wang
Lucas Caccia
Oleksiy Ostapenko
Xingdi Yuan
William Yang Wang
Large language models (LLMs) have recently attracted considerable interest for their ability to perform complex reasoning tasks, such as cha… (see more)in-of-thought (CoT) reasoning. However, most of the existing approaches to enhance this ability rely heavily on data-driven methods, while neglecting the structural aspects of the model's reasoning capacity. To encourage a more structural generation of CoT steps, we propose a hierarchical generation scheme: we let the LM generate a planning token at the start of each reasoning step, intuitively serving as a high-level plan of the current step, and add their embeddings to the model parameters. Our approach requires a negligible increase in trainable parameters (0.001%) and can be applied through either full fine-tuning or a more parameter-efficient scheme. We demonstrate our method's effectiveness by applying it to three different LLMs, showing notable accuracy improvements across three math word problem datasets and one multihop QA dataset with respect to standard fine-tuning baselines.
LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders
Parishad BehnamGhader
Vaibhav Adlakha
Marius Mosbach
Redesigning Information Markets in the Era of Language Models
Martin Weiss
Nasim Rahaman
Manuel Wüthrich
Li Erran Li
Bernhard Schölkopf
Scattered Mixture-of-Experts Implementation
Shawn Tan
Yikang Shen
Rameswar Panda
ScatterMoE is an implementation of Sparse Mixture-of-Experts (SMoE) on GPUs. ScatterMoE builds upon techniques in existing implementations, … (see more)and overcoming some of the current limitations to improve batched inference, training speed, and memory footprint. This implementation achieves this by avoiding padding and making excessive copies of the input. We also fuse expert linear transforms and reordering operations with ParallelLinear, a module that can be used to extend the concept of SMoEs. We benchmark our implementation against Megablocks, and show that it enables a higher throughput and lower memory footprint. We also show how ParallelLinear enables extension of the Mixture-of-Experts concept by demonstrating with an implementation of Mixture-of-Attention.
Seeking Interpretability and Explainability in Binary Activated Neural Networks
Benjamin Leblanc
Should We Attend More or Less? Modulating Attention for Fairness
Abdelrahman Zayed
Goncalo Mordido
Samira Shabanian
A Survey on Deep Learning for Theorem Proving
Zhaoyu Li
Jialiang Sun
Logan Murphy
Qidong Su
Zenan Li
Xian Zhang
Kaiyu Yang
The black box of the relationship between breast cancer patients and accompanying patients: the accompanied patients’ point of view
Marie-Pascale Pomey
Monica Iliescu Nelea
Cécile Vialaron
Louise Normandin
Marie‐Andrée Côté
Mado Desforges
Pénélope Pomey‐Carpentier
Nesrine Adjtoutah
Israël Fortin
Isabelle Ganache
Zeev Rosberger
Danielle Charpentier
Lynda Bélanger
Michel Dorval
Djahanchah Philip Ghadiri
Mélanie Lavoie-Tremblay
Antoine Boivin
Jean-François Pelletier
Nicolas Fernandez … (see 2 more)
Alain M. Danino
Michèle de Guise
Trust No Bot: Discovering Personal Disclosures in Human-LLM Conversations in the Wild
Niloofar Mireshghallah
Maria Antoniak
Yash More
Yejin Choi
Measuring personal disclosures made in human-chatbot interactions can provide a better understanding of users' AI literacy and facilitate pr… (see more)ivacy research for large language models (LLMs). We run an extensive, fine-grained analysis on the personal disclosures made by real users to commercial GPT models, investigating the leakage of personally identifiable and sensitive information. To understand the contexts in which users disclose to chatbots, we develop a taxonomy of tasks and sensitive topics, based on qualitative and quantitative analysis of naturally occurring conversations. We discuss these potential privacy harms and observe that: (1) personally identifiable information (PII) appears in unexpected contexts such as in translation or code editing (48% and 16% of the time, respectively) and (2) PII detection alone is insufficient to capture the sensitive topics that are common in human-chatbot interactions, such as detailed sexual preferences or specific drug use habits. We believe that these high disclosure rates are of significant importance for researchers and data curators, and we call for the design of appropriate nudging mechanisms to help users moderate their interactions.