Publications

A flaw in using pre-trained pLLMs in protein-protein interaction inference models
With the growing pervasiveness of pre-trained protein large language models (pLLMs), pLLM-based methods are increasingly being put forward f… (see more)or the protein-protein interaction (PPI) inference task. Here, we identify and confirm that existing pre-trained pLLMs are a source of data leakage for the downstream PPI task. We characterize the extent of the data leakage problem by training and comparing small and efficient pLLMs on a dataset that controls for data leakage (“strict”) with one that does not (“non-strict”). While data leakage from pre-trained pLLMs cause measurable inflation of testing scores, we find that this does not necessarily extend to other, non-paired biological tasks such as protein keyword annotation. Further, we find no connection between the context-lengths of pLLMs and the performance of pLLM-based PPI inference methods on proteins with sequence lengths that surpass it. Furthermore, we show that pLLM-based and non-pLLM-based models fail to generalize in tasks such as prediction of the human-SARS-CoV-2 PPIs or the effect of point mutations on binding-affinities. This study demonstrates the importance of extending existing protocols for the evaluation of pLLM-based models applied to paired biological datasets and identifies areas of weakness of current pLLM models.
Multilingual Hallucination Gaps
Cléa Chataigner
Performative Prediction on Games and Mechanism Design
Mehrnaz Mofakhami
Fernando P. Santos
Planning and Learning in Risk-Aware Restless Multi-Arm Bandits
Yossiri Adulyasak
Privacy-Preserving Group Fairness in Cross-Device Federated Learning
Sikha Pentyala
Nicola Neophytou
Anderson Nascimento
Martine De Cock
Group fairness ensures that the outcome of machine learning (ML) based decision making systems are notbiased towards a certain group of peop… (see more)le defined by a sensitive attribute such as gender or ethnicity. Achievinggroup fairness in Federated Learning (FL) is challenging because mitigating bias inherently requires usingthe sensitive attribute values of all clients, while FL is aimed precisely at protecting privacy by not givingaccess to the clients’ data. As we show in this paper, this conflict between fairness and privacy in FL can beresolved by combining FL with Secure Multiparty Computation (MPC) and Differential Privacy (DP). Tothis end, we propose a privacy-preserving approach to calculate group fairness notions in the cross-device FLsetting. Then, we propose two bias mitigation pre-processing and post-processing techniques in cross-deviceFL under formal privacy guarantees, without requiring the clients to disclose their sensitive attribute values.Empirical evaluations on real world datasets demonstrate the effectiveness of our solution to train fair andaccurate ML models in federated cross-device setups with privacy guarantees to the users.
Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis
Jia Lin Hau
Mohammad Ghavamzadeh
Marek Petrik
Representation Learning via Non-Contrastive Mutual Information
Zhaohan Daniel Guo
Bernardo Avila Pires
Dale Schuurmans
Bo Dai
Representation Learning via Non-Contrastive Mutual Information
Zhaohan Daniel Guo
Bernardo Avila Pires
Dale Schuurmans
Bo Dai
On the Identifiability of Causal Abstractions
Sékou-Oumar Kaba
Causal representation learning (CRL) enhances machine learning models' robustness and generalizability by learning structural causal models … (see more)associated with data-generating processes. We focus on a family of CRL methods that uses contrastive data pairs in the observable space, generated before and after a random, unknown intervention, to identify the latent causal model. (Brehmer et al., 2022) showed that this is indeed possible, given that all latent variables can be intervened on individually. However, this is a highly restrictive assumption in many systems. In this work, we instead assume interventions on arbitrary subsets of latent variables, which is more realistic. We introduce a theoretical framework that calculates the degree to which we can identify a causal model, given a set of possible interventions, up to an abstraction that describes the system at a higher level of granularity.
LLMs are Greedy Agents: Effects of RL Fine-tuning on Decision-Making Abilities
Thomas Schmied
Jorg Bornschein
Jordi Grau-Moya
Markus Wulfmeier
Neural Kinematic Bases for Fluids
Yibo Liu
Paul Kry
Kenny Erleben
Sune Darkner
Teseo Schneider
Refining sequence-to-expression modelling with chromatin accessibility
Gregory Fonseca