Dynamics of SGD with Stochastic Polyak Stepsizes: Truly Adaptive Variants and Convergence to Exact Solution
Antonio Orvieto
Nicolas Loizou
Recently Loizou et al. (2021), proposed and analyzed stochastic gradient descent (SGD) with stochastic Polyak stepsize (SPS). The proposed S… (see more)PS comes with strong convergence guarantees and competitive performance; however, it has two main drawbacks when it is used in non-over-parameterized regimes: (i) It requires a priori knowledge of the optimal mini-batch losses, which are not available when the interpolation condition is not satisfied (e.g., regularized objectives), and (ii) it guarantees convergence only to a neighborhood of the solution. In this work, we study the dynamics and the convergence properties of SGD equipped with new variants of the stochastic Polyak stepsize and provide solutions to both drawbacks of the original SPS. We first show that a simple modification of the original SPS that uses lower bounds instead of the optimal function values can directly solve issue (i). On the other hand, solving issue (ii) turns out to be more challenging and leads us to valuable insights into the method's behavior. We show that if interpolation is not satisfied, the correlation between SPS and stochastic gradients introduces a bias, which effectively distorts the expectation of the gradient signal near minimizers, leading to non-convergence - even if the stepsize is scaled down during training. To fix this issue, we propose DecSPS, a novel modification of SPS, which guarantees convergence to the exact minimizer - without a priori knowledge of the problem parameters. For strongly-convex optimization problems, DecSPS is the first stochastic adaptive optimization method that converges to the exact solution without restrictive assumptions like bounded iterates/gradients.
Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual Embeddings
Iz Beltagy
Kyle Lo
Arman Cohan. 2019
Scib-500
R´ejean Ducharme
Rishi Bommasani
Kelly Davis
Claire Cardie
Billy Chiu
Sampo Pyysalo
Ivan Vuli´c
Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003… (see more) static word embeddings to uncover "latent 004 knowledge" contained within domain-specific 005 scientific corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later verified using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
There has been significant recent progress in causal representation learning that has showed a variety of settings in which we can disentang… (see more)le latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are d − dimensional vectors, and (2) that the observations are the output of some injective observation function of these latent variables. While these assumptions appear benign—they amount to assuming that any changes in the latent space are reflected in the observation space, and that we can use standard encoders to infer the latent variables—we show that when the observations are of multiple objects, the observation function is no longer injective, and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture (Locatello et al., 2020b), we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object’s properties. We argue that this approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and, we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.
Extended Abstract Track
Amin Mansouri
Jason Hartford
Sophia Sanborn
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
Extended Abstract Track
Amin Mansouri
Jason Hartford
Kartik Ahuja
Christian Shewmake
Simone Azeglio
Arianna Di Bernardo
Nina Miolane
Extracting Person Names from User Generated Text: Named-Entity Recognition for Combating Human Trafficking
Yifei Li
Pratheeksha Nair
Kellin Pelrine
Online escort advertisement websites are widely used for advertising victims of human trafficking. Domain experts agree that advertising mul… (see more)tiple people in the same ad is a strong indicator of trafficking. Thus, extracting person names from the text of these ads can provide valuable clues for further analysis. However, Named-Entity Recognition (NER) on escort ads is challenging because the text can be noisy, colloquial and often lacking proper grammar and punctuation. Most existing state-of-the-art NER models fail to demonstrate satisfactory performance in this task. In this paper, we propose NEAT (Name Extraction Against Trafficking) for extracting person names. It effectively combines classic rule-based and dictionary extractors with a contextualized language model to capture ambiguous names (e.g penny, hazel) and adapts to adversarial changes in the text by expanding its dictionary. NEAT shows 19% improvement on average in the F1 classification score for name extraction compared to previous state-of-the-art in two domain-specific datasets.
Extracting Person Names from User Generated Text: Named-Entity Recognition for Combating Human Trafficking
Yifei Li
Pratheeksha Nair
Kellin Pelrine
Feeding What You Need by Understanding What You Learned
Xiaoqiang Wang
Fangli Xu
Bo Long
Siliang Tang
Lingfei Wu
Few-Shot Pidgin Text Adaptation via Contrastive Fine-Tuning
Ernie Chang
Jesujoba Oluwadara Alabi
Vera Demberg
The surging demand for multilingual dialogue systems often requires a costly labeling process for each language addition. For low resource l… (see more)anguages, human annotators are continuously tasked with the adaptation of resource-rich language utterances for each new domain. However, this prohibitive and impractical process can often be a bottleneck for low resource languages that are still without proper translation systems nor parallel corpus. In particular, it is difficult to obtain task-specific low resource language annotations for the English-derived creoles (e.g. Nigerian and Cameroonian Pidgin). To address this issue, we utilize the pretrained language models i.e. BART which has shown great potential in language generation/understanding – we propose to finetune the BART model to generate utterances in Pidgin by leveraging the proximity of the source and target languages, and utilizing positive and negative examples in constrastive training objectives. We collected and released the first parallel Pidgin-English conversation corpus in two dialogue domains and showed that this simple and effective technique is suffice to yield impressive results for English-to-Pidgin generation, which are two closely-related languages.
Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages
Md Mahfuz Ibn Alam
Antonios Anastasopoulos
Akshita Bhagia
Marta R. Costa-jussa
Jesse Dodge
Fahim Faisal
Christian Federmann
Natalia N. Fedorova
Francisco S. Guzm'an
Sergey Koshelev
Jean Maillard
Vukosi Marivate
Jonathan Mbuya
Alexandre Mourachko
Safiyyah Saleem
Holger Schwenk
Guillaume Wenzek
We present the results of the WMT’22 SharedTask on Large-Scale Machine Translation Evaluation for African Languages. The shared taskinclud… (see more)ed both a data and a systems track, alongwith additional innovations, such as a focus onAfrican languages and extensive human evaluation of submitted systems. We received 14system submissions from 8 teams, as well as6 data track contributions. We report a largeprogress in the quality of translation for Africanlanguages since the last iteration of this sharedtask: there is an increase of about 7.5 BLEUpoints across 72 language pairs, and the average BLEU scores went from 15.09 to 22.60.