Publications

Enhanced Biomedical Knowledge Discovery From Unstructured Text Using Contextual Embeddings

Iz Beltagy

Kyle Lo

Arman Cohan. 2019

Scib-500

Yoshua Bengio

R´ejean Ducharme

Pascal Vincent

Rishi Bommasani

Kelly Davis

Claire Cardie

Billy Chiu

Sampo Pyysalo

Ivan Vuli´c

Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003… (voir plus) static word embeddings to uncover "latent 004 knowledge" contained within domain-speciﬁc 005 scientiﬁc corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later veriﬁed using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts

Extended Abstract Track

Amin Mansouri

Jason Hartford

Kartik Ahuja

Yoshua Bengio

Christian Shewmake

Simone Azeglio

Arianna Di Bernardo

Nina Miolane

Extended Abstract Track

Amin Mansouri

Jason Hartford

Yoshua Bengio

Sophia Sanborn

Christian Shewmake

Simone Azeglio

Arianna Di Bernardo

Nina Miolane

Extended Abstract Track

Amin Mansouri

Jason Hartford

Kartik Ahuja

Yoshua Bengio

Christian Shewmake

Simone Azeglio

Arianna Di Bernardo

Nina Miolane

Extended Abstract Track

Amin Mansouri

Jason Hartford

Kartik Ahuja

Yoshua Bengio

Christian Shewmake

Simone Azeglio

Arianna Di Bernardo

Nina Miolane

Extended Abstract Track

Amin Mansouri

Jason Hartford

Kartik Ahuja

Yoshua Bengio

Christian Shewmake

Simone Azeglio

Arianna Di Bernardo

Nina Miolane

There has been significant recent progress in causal representation learning that has showed a variety of settings in which we can disentang… (voir plus)le latent variables with identifiability guarantees (up to some reasonable equivalence class). Common to all of these approaches is the assumption that (1) the latent variables are d − dimensional vectors, and (2) that the observations are the output of some injective observation function of these latent variables. While these assumptions appear benign—they amount to assuming that any changes in the latent space are reflected in the observation space, and that we can use standard encoders to infer the latent variables—we show that when the observations are of multiple objects, the observation function is no longer injective, and disentanglement fails in practice. We can address this failure by combining recent developments in object-centric learning and causal representation learning. By modifying the Slot Attention architecture (Locatello et al., 2020b), we develop an object-centric architecture that leverages weak supervision from sparse perturbations to disentangle each object’s properties. We argue that this approach is more data-efficient in the sense that it requires significantly fewer perturbations than a comparable approach that encodes to a Euclidean space and, we show that this approach successfully disentangles the properties of a set of objects in a series of simple image-based disentanglement experiments.

2022-01-01

(publié)

www.semanticscholar.org

Extracting Person Names from User Generated Text: Named-Entity Recognition for Combating Human Trafficking

Yifei Li

Pratheeksha Nair

Kellin Pelrine

Reihaneh Rabbany

Online escort advertisement websites are widely used for advertising victims of human trafficking. Domain experts agree that advertising mul… (voir plus)tiple people in the same ad is a strong indicator of trafficking. Thus, extracting person names from the text of these ads can provide valuable clues for further analysis. However, Named-Entity Recognition (NER) on escort ads is challenging because the text can be noisy, colloquial and often lacking proper grammar and punctuation. Most existing state-of-the-art NER models fail to demonstrate satisfactory performance in this task. In this paper, we propose NEAT (Name Extraction Against Trafficking) for extracting person names. It effectively combines classic rule-based and dictionary extractors with a contextualized language model to capture ambiguous names (e.g penny, hazel) and adapts to adversarial changes in the text by expanding its dictionary. NEAT shows 19% improvement on average in the F1 classification score for name extraction compared to previous state-of-the-art in two domain-specific datasets.

2022-01-01

Findings (published)

doi.org

Extracting Person Names from User Generated Text: Named-Entity Recognition for Combating Human Trafficking

Yifei Li

Pratheeksha Nair

Kellin Pelrine

Reihaneh Rabbany

2022-01-01

Findings (publié)

doi.org

Feeding What You Need by Understanding What You Learned

Xiaoqiang Wang

Bang Liu

Fangli Xu

Bo Long

Siliang Tang

Lingfei Wu

2022-01-01

ACL (1) (publié)

doi.org

arxiv.org

Few-Shot Pidgin Text Adaptation via Contrastive Fine-Tuning

Ernie Chang

Jesujoba Oluwadara Alabi

David Ifeoluwa Adelani

Vera Demberg

The surging demand for multilingual dialogue systems often requires a costly labeling process for each language addition. For low resource l… (voir plus)anguages, human annotators are continuously tasked with the adaptation of resource-rich language utterances for each new domain. However, this prohibitive and impractical process can often be a bottleneck for low resource languages that are still without proper translation systems nor parallel corpus. In particular, it is difficult to obtain task-specific low resource language annotations for the English-derived creoles (e.g. Nigerian and Cameroonian Pidgin). To address this issue, we utilize the pretrained language models i.e. BART which has shown great potential in language generation/understanding – we propose to finetune the BART model to generate utterances in Pidgin by leveraging the proximity of the source and target languages, and utilizing positive and negative examples in constrastive training objectives. We collected and released the first parallel Pidgin-English conversation corpus in two dialogue domains and showed that this simple and effective technique is suffice to yield impressive results for English-to-Pidgin generation, which are two closely-related languages.

2022-01-01

COLING (publié)

dblp.uni-trier.de

Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages

David Ifeoluwa Adelani

Md Mahfuz Ibn Alam

Antonios Anastasopoulos

Akshita Bhagia

Marta R. Costa-jussa

Jesse Dodge

Fahim Faisal

Christian Federmann

Natalia N. Fedorova

Francisco S. Guzm'an

Sergey Koshelev

Jean Maillard

Vukosi Marivate

Jonathan Mbuya

Alexandre Mourachko

Safiyyah Saleem

Holger Schwenk

Guillaume Wenzek

We present the results of the WMT’22 SharedTask on Large-Scale Machine Translation Evaluation for African Languages. The shared taskinclud… (voir plus)ed both a data and a systems track, alongwith additional innovations, such as a focus onAfrican languages and extensive human evaluation of submitted systems. We received 14system submissions from 8 teams, as well as6 data track contributions. We report a largeprogress in the quality of translation for Africanlanguages since the last iteration of this sharedtask: there is an increase of about 7.5 BLEUpoints across 72 language pairs, and the average BLEU scores went from 15.09 to 22.60.

2022-01-01

Conference on Machine Translation (publié)

dblp.uni-trier.de

Flexible Diffusion Modeling of Long Videos

William Harvey

Saeid Naderiparizi

Vaden Masrani

Christian Dietrich Weilbach

Frank N. Wood

We present a framework for video modeling based on denoising diffusion probabilistic models that produces long-duration video completions in… (voir plus) a variety of realistic environments. We introduce a generative model that can at test-time sample any arbitrary subset of video frames conditioned on any other subset and present an architecture adapted for this purpose. Doing so allows us to efficiently compare and optimize a variety of schedules for the order in which frames in a long video are sampled and use selective sparse and long-range conditioning on previously sampled frames. We demonstrate improved video modeling over prior work on a number of datasets and sample temporally coherent videos over 25 minutes in length. We additionally release a new video modeling dataset and semantically meaningful metrics based on videos generated in the CARLA autonomous driving simulator.

openreview.net

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Publications

Avantage IA

Mettre à profit l'IA pour un avenir durable

Bourse Mila en politiques de l'IA

Avantage IA

Mettre à profit l'IA pour un avenir durable

Mots-clés populaires:

Publications