Publications

High-Order Pooling for Graph Neural Networks with Tensor Decomposition
IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages
Emanuele Bugliarello
Fangyu Liu
Jonas Pfeiffer
Desmond Elliott
Edoardo Ponti
Ivan Vulic
Reliable evaluation benchmarks designed for replicability and comprehensiveness have driven progress in machine learning. Due to the lack of… (see more) a multilingual benchmark, however, vision-and-language research has mostly focused on English language tasks. To fill this gap, we introduce the Image-Grounded Language Understanding Evaluation benchmark. IGLUE brings together - by both aggregating pre-existing datasets and creating new ones - visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages. Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks. Moreover, downstream performance is partially explained by the amount of available unlabelled textual data for pretraining, and only weakly by the typological distance of target-source languages. We hope to encourage future research efforts in this area by releasing the benchmark to the community.
Implicit Regularization or Implicit Conditioning? Exact Risk Trajectories of SGD in High Dimensions
Elliot Paquette
Ben Adlam
Jeffrey Pennington
Stochastic gradient descent (SGD) is a pillar of modern machine learning, serving as the go-to optimization algorithm for a diverse array of… (see more) problems. While the empirical success of SGD is often attributed to its computational efficiency and favorable generalization behavior, neither effect is well understood and disentangling them remains an open problem. Even in the simple setting of convex quadratic problems, worst-case analyses give an asymptotic convergence rate for SGD that is no better than full-batch gradient descent (GD), and the purported implicit regularization effects of SGD lack a precise explanation. In this work, we study the dynamics of multi-pass SGD on high-dimensional convex quadratics and establish an asymptotic equivalence to a stochastic differential equation, which we call homogenized stochastic gradient descent (HSGD), whose solutions we characterize explicitly in terms of a Volterra integral equation. These results yield precise formulas for the learning and risk trajectories, which reveal a mechanism of implicit conditioning that explains the efficiency of SGD relative to GD. We also prove that the noise from SGD negatively impacts generalization performance, ruling out the possibility of any type of implicit regularization in this context. Finally, we show how to adapt the HSGD formalism to include streaming SGD, which allows us to produce an exact prediction for the excess risk of multi-pass SGD relative to that of streaming SGD (bootstrap risk).
Improved DC-Free Run-Length Limited 4B6B Codes for Concatenated Schemes
Elie Ngomseu Mambou
Thibaud Tonnellier
In this letter, we introduce a class of improved DC-free 4B6B codes in terms of error correction capabilities for a serially concatenated ar… (see more)chitecture. There are billions of different codebooks that can be derived from the 16 codewords contained in the traditional 4B6B code as per the IEEE 802.15.7 standard for visible light communication (VLC). These codebooks can be classified based on distances properties which determine their error correction performances. The traditional 4B6B code is suitable for hard-decision decoding, however, when a soft decoder is used like in a serially concatenated architecture, that code becomes obsolete. Simulations show that the proposed 4B6B code concatenated with forward error correction (FEC) codes, has better performance compared to state-of-the-art schemes such as the original 4B6B code, the enhanced Miller code, the Manchester code, the 5B10B code and the (0,4) 2/3 RLL code.
Improved DC-Free Run-Length Limited 4B6B Codes for Concatenated Schemes
Elie Ngomseu Mambou
Thibaud Tonnellier
In this letter, we introduce a class of improved DC-free 4B6B codes in terms of error correction capabilities for a serially concatenated ar… (see more)chitecture. There are billions of different codebooks that can be derived from the 16 codewords contained in the traditional 4B6B code as per the IEEE 802.15.7 standard for visible light communication (VLC). These codebooks can be classified based on distances properties which determine their error correction performances. The traditional 4B6B code is suitable for hard-decision decoding, however, when a soft decoder is used like in a serially concatenated architecture, that code becomes obsolete. Simulations show that the proposed 4B6B code concatenated with forward error correction (FEC) codes, has better performance compared to state-of-the-art schemes such as the original 4B6B code, the enhanced Miller code, the Manchester code, the 5B10B code and the (0,4) 2/3 RLL code.
Investigating the Performance of Transformer-Based NLI Models on Presuppositional Inferences
Jad Kabbara
Presuppositions are assumptions that are taken for granted by an utterance, and identifying them is key to a pragmatic interpretation of lan… (see more)guage. In this paper, we investigate the capabilities of transformer models to perform NLI on cases involving presupposition. First, we present simple heuristics to create alternative “contrastive” test cases based on the ImpPres dataset and investigate the model performance on those test cases. Second, to better understand how the model is making its predictions, we analyze samples from sub-datasets of ImpPres and examine model performance on them. Overall, our findings suggest that NLI-trained transformer models seem to be exploiting specific structural and lexical cues as opposed to performing some kind of pragmatic reasoning.
KNIFE: Kernelized-Neural Differential Entropy Estimation
Georg Pichler
Pierre Colombo
Malik Boudiaf
Gunther Koliander
Mutual Information (MI) has been widely used as a loss regularizer for training neural networks. This has been particularly effective when l… (see more)earn dis-entangled or compressed representations of high dimensional data. However, differential entropy (DE), another fundamental measure of information, has not found widespread use in neural network training. Although DE offers a potentially wider range of applications than MI, off-the-shelf DE estimators are either non differentiable, computationally intractable or fail to adapt to changes in the underlying distribution. These drawbacks prevent them from being used as regularizers in neural networks training. To address shortcomings in previously proposed estimators for DE, here we introduce K NIFE , a fully parameterized, differentiable kernel-based estimator of DE. The flexibility of our approach also allows us to construct K NIFE -based estimators for conditional (on either discrete or continuous variables) DE, as well as MI. We empirically validate our method on high-dimensional synthetic data and further apply it to guide the training of neural networks for real-world tasks. Our experiments on a large variety of tasks, including visual domain adaptation, textual fair classification, and textual fine-tuning demonstrate the effectiveness of K NIFE - based estimation. Code can be found at https: //github.com/g-pichler/knife .
Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty
Thomas George
Aristide Baratin
Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called `laz… (see more)y' training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.
On Learning Fairness and Accuracy on Multiple Subgroups
Changjian Shui
Gezheng Xu
Qi CHEN
Jiaqi Li
Charles Ling
Boyu Wang
We propose an analysis in fair learning that preserves the utility of the data while reducing prediction disparities under the criteria of g… (see more)roup sufficiency. We focus on the scenario where the data contains multiple or even many subgroups, each with limited number of samples. As a result, we present a principled method for learning a fair predictor for all subgroups via formulating it as a bilevel objective. Specifically, the subgroup specific predictors are learned in the lower-level through a small amount of data and the fair predictor. In the upper-level, the fair predictor is updated to be close to all subgroup specific predictors. We further prove that such a bilevel objective can effectively control the group sufficiency and generalization error. We evaluate the proposed framework on real-world datasets. Empirical evidence suggests the consistently improved fair predictions, as well as the comparable accuracy to the baselines.
Learning Inter-Modal Correspondence and Phenotypes From Multi-Modal Electronic Health Records
Kejing Yin
William K. Cheung
Jonathan Poon
Non-negative tensor factorization has been shown a practical solution to automatically discover phenotypes from the electronic health record… (see more)s (EHR) with minimal human supervision. Such methods generally require an input tensor describing the inter-modal interactions to be pre-established; however, the correspondence between different modalities (e.g., correspondence between medications and diagnoses) can often be missing in practice. Although heuristic methods can be applied to estimate them, they inevitably introduce errors, and leads to sub-optimal phenotype quality. This is particularly important for patients with complex health conditions (e.g., in critical care) as multiple diagnoses and medications are simultaneously present in the records. To alleviate this problem and discover phenotypes from EHR with unobserved inter-modal correspondence, we propose the collective hidden interaction tensor factorization (cHITF) to infer the correspondence between multiple modalities jointly with the phenotype discovery. We assume that the observed matrix for each modality is marginalization of the unobserved inter-modal correspondence, which are reconstructed by maximizing the likelihood of the observed matrices. Extensive experiments conducted on the real-world MIMIC-III dataset demonstrate that cHITF effectively infers clinically meaningful inter-modal correspondence, discovers phenotypes that are more clinically relevant and diverse, and achieves better predictive performance compared with a number of state-of-the-art computational phenotyping models.
A Learning Metaheuristic Algorithm for a Scheduling Application
Nazgol Niroumandrad
Nadia Lahrichi
Andrea Lodi
Learning Representations for New Sound Classes With Continual Self-Supervised Learning
Zhepei Wang
Xilin Jiang
Junkai Wu
Efthymios Tzinis
Paris Smaragdis
In this article, we work on a sound recognition system that continually incorporates new sound classes. Our main goal is to develop a framew… (see more)ork where the model can be updated without relying on labeled data. For this purpose, we propose adopting representation learning, where an encoder is trained using unlabeled data. This learning framework enables the study and implementation of a practically relevant use case where only a small amount of the labels is available in a continual learning context. We also make the empirical observation that a similarity-based representation learning method within this framework is robust to forgetting even if no explicit mechanism against forgetting is employed. We show that this approach obtains similar performance compared to several distillation-based continual learning methods when employed on self-supervised representation learning methods.