Extracting knowledge from large, unstruc-001 tured text corpora presents a challenge. Re-002 cently, authors have utilized unsupervised, 003
… (see more) static word embeddings to uncover "latent 004 knowledge" contained within domain-specific 005 scientific corpora. Here semantic-similarity 006 measures between representations of concepts, 007 objects or entities were used to predict re-008 lationships, which were later verified using 009 physical methods. Static language models 010 have recently been surpassed at most down-011 stream tasks by massively pre-trained, contex-012 tual language models like BERT. Some have 013 postulated that contextualized embeddings po-014 tentially yield word representations superior 015 to static ones for knowledge-discovery pur-016 poses. In an effort to address this ques-017 tion, two biomedically-trained BERT models 018 (BioBERT, SciBERT) were used to encode 019 n = 500, 1000 or 5000 sentences containing 020 words of interest extracted from a biomedical 021 corpus (Coronavirus Open Research Dataset). 022 The n representations for the words of inter-023 est were subsequently extracted and then ag-024 gregated to yield static-equivalent word rep-025 resentations. These words belonged to the 026 vocabularies of intrinsic benchmarking tools 027 for the biomedical domain (Bio-SimVerb and 028 Bio-SimLex), which assess quality of word 029 representations using semantic-similarity and 030 relatedness measures. Using intrinsic bench-031 marking tasks, feasibility of using contextual-032 ized word representations for knowledge dis-033 covery tasks can be assessed: Word represen-034 tations that better encode described reality are 035 expected to perform better (i.e. closer to do-036 main experts). As postulated, BERT embed-037 dings outperform static counterparts