Publications

Embedding Cultural Diversity in Prototype-based Recommender Systems
Armin Moradi
Nicola Neophytou
Florian Carichon
Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing u… (see more)nderrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27\% reduction in the average rank of long-tail items and a 2\% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2\% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.
Embedding Cultural Diversity in Prototype-based Recommender Systems
Armin Moradi
Nicola Neophytou
Florian Carichon
Popularity bias in recommender systems can increase cultural overrepresentation by favoring norms from dominant cultures and marginalizing u… (see more)nderrepresented groups. This issue is critical for platforms offering cultural products, as they influence consumption patterns and human perceptions. In this work, we address popularity bias by identifying demographic biases within prototype-based matrix factorization methods. Using the country of origin as a proxy for cultural identity, we link this demographic attribute to popularity bias by refining the embedding space learning process. First, we propose filtering out irrelevant prototypes to improve representativity. Second, we introduce a regularization technique to enforce a uniform distribution of prototypes within the embedding space. Across four datasets, our results demonstrate a 27\% reduction in the average rank of long-tail items and a 2\% reduction in the average rank of items from underrepresented countries. Additionally, our model achieves a 2\% improvement in HitRatio@10 compared to the state-of-the-art, highlighting that fairness is enhanced without compromising recommendation quality. Moreover, the distribution of prototypes leads to more inclusive explanations by better aligning items with diverse prototypes.
Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference
Matthew Riemer
Gopeshh Raaj Subbaraj
Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectivel… (see more)y minimize regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime reinforcement learning (RL) environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment's effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pok\'emon and Tetris.
Enabling Realtime Reinforcement Learning at Scale with Staggered Asynchronous Inference
Matthew Riemer
Gopeshh Raaj Subbaraj
Realtime environments change even as agents perform action inference and learning, thus requiring high interaction frequencies to effectivel… (see more)y minimize regret. However, recent advances in machine learning involve larger neural networks with longer inference times, raising questions about their applicability in realtime systems where reaction time is crucial. We present an analysis of lower bounds on regret in realtime reinforcement learning (RL) environments to show that minimizing long-term regret is generally impossible within the typical sequential interaction and learning paradigm, but often becomes possible when sufficient asynchronous compute is available. We propose novel algorithms for staggering asynchronous inference processes to ensure that actions are taken at consistent time intervals, and demonstrate that use of models with high action inference times is only constrained by the environment's effective stochasticity over the inference horizon, and not by action frequency. Our analysis shows that the number of inference processes needed scales linearly with increasing inference times while enabling use of models that are multiple orders of magnitude larger than existing approaches when learning from a realtime simulation of Game Boy games such as Pok\'emon and Tetris.
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adina Williams
Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to aut… (see more)omatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
What makes a good metric? Evaluating automatic metrics for text-to-image consistency
Candace Ross
Melissa Hall
Adina Williams
Language models are increasingly being incorporated as components in larger AI systems for various purposes, from prompt optimization to aut… (see more)omatic evaluation. In this work, we analyze the construct validity of four recent, commonly used methods for measuring text-to-image consistency - CLIPScore, TIFA, VPEval, and DSG - which rely on language models and/or VQA models as components. We define construct validity for text-image consistency metrics as a set of desiderata that text-image consistency metrics should have, and find that no tested metric satisfies all of them. We find that metrics lack sufficient sensitivity to language and visual properties. Next, we find that TIFA, VPEval and DSG contribute novel information above and beyond CLIPScore, but also that they correlate highly with each other. We also ablate different aspects of the text-image consistency metrics and find that not all model components are strictly necessary, also a symptom of insufficient sensitivity to visual information. Finally, we show that all three VQA-based metrics likely rely on familiar text shortcuts (such as yes-bias in QA) that call their aptitude as quantitative evaluations of model performance into question.
Hint Marginalization for Improved Reasoning in Large Language Models
Soumyasundar Pal
Didier Ch'etelat
Yingxue Zhang
Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.
Hint Marginalization for Improved Reasoning in Large Language Models
Soumyasundar Pal
Didier Ch'etelat
Yingxue Zhang
Large Language Models (LLMs) have exhibited an impressive capability to perform reasoning tasks, especially if they are encouraged to genera… (see more)te a sequence of intermediate steps. Reasoning performance can be improved by suitably combining multiple LLM responses, generated either in parallel in a single query, or via sequential interactions with LLMs throughout the reasoning process. Existing strategies for combination, such as self-consistency and progressive-hint-prompting, make inefficient usage of the LLM responses. We present Hint Marginalization, a novel and principled algorithmic framework to enhance the reasoning capabilities of LLMs. Our approach can be viewed as an iterative sampling strategy for forming a Monte Carlo approximation of an underlying distribution of answers, with the goal of identifying the mode the most likely answer. Empirical evaluation on several benchmark datasets for arithmetic reasoning demonstrates the superiority of the proposed approach.
Leveraging Data Characteristics for Bug Localization in Deep Learning Programs
Ruchira Manke
Mohammad Wardat
Hridesh Rajan
Deep Learning (DL) is a class of machine learning algorithms that are used in a wide variety of applications. Like any software system, DL p… (see more)rograms can have bugs. To support bug localization in DL programs, several tools have been proposed in the past. As most of the bugs that occur due to improper model structure known as structural bugs lead to inadequate performance during training, it is challenging for developers to identify the root cause and address these bugs. To support bug detection and localization in DL programs, in this paper, we propose Theia, which detects and localizes structural bugs in DL programs. Unlike the previous works, Theia considers the training dataset characteristics to automatically detect bugs in DL programs developed using two deep learning libraries, Keras and PyTorch . Since training the DL models is a time-consuming process, Theia detects these bugs at the beginning of the training process and alerts the developer with informative messages containing the bug's location and actionable fixes which will help them to improve the structure of the model. We evaluated Theia on a benchmark of 40 real-world buggy DL programs obtained from Stack Overflow . Our results show that Theia successfully localizes 57/75 structural bugs in 40 buggy programs, whereas NeuraLint, a state-of-the-art approach capable of localizing structural bugs before training localizes 17/75 bugs.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Leuson Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (see more) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Léuson M. P. Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (see more) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.
Continuously Learning Bug Locations
Paulina Stevia Nouwou Mindom
Léuson M. P. Da Silva
Amin Nikanjam
Automatically locating buggy changesets associated with bug reports is crucial in the software development process. Deep Learning (DL)-based… (see more) techniques show promising results by leveraging structural information from the code and learning links between changesets and bug reports. However, since source code associated with changesets evolves, the performance of such models tends to degrade over time due to concept drift. Aiming to address this challenge, in this paper, we evaluate the potential of using Continual Learning (CL) techniques in multiple sub-tasks setting for bug localization (each of which operates on either stationary or non-stationary data), comparing it against a bug localization technique that leverages the BERT model, a deep reinforcement learning-based technique that leverages the A2C algorithm, and a DL-based function-level interaction model for semantic bug localization. Additionally, we enhanced the CL techniques by using logistic regression to identify and integrate the most significant bug-inducing factors. Our empirical evaluation across seven widely used software projects shows that CL techniques perform better than DL-based techniques by up to 61% in terms of Mean Reciprocal Rank (MRR), 44% in terms of Mean Average Precision (MAP), 83% in terms of top@1, 56% in terms of top@5, and 66% in terms of top@10 metrics in non-stationary setting. Further, we show that the CL techniques we studied are effective at localizing changesets relevant to a bug report while being able to mitigate catastrophic forgetting across the studied tasks and require up to 5x less computational effort during training. Our findings demonstrate the potential of adopting CL for bug localization in non-stationary settings, and we hope it helps to improve bug localization activities in Software Engineering using CL techniques.