Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts
Authors
Košprdić, MilošProdanović, Nikola
Ljajić, Adela
Bašaragin, Bojana
Milošević, Nikola
Contributors
Morić, IvanaĐorđević, Valentina
Conference object (Published version)
,
© 2023 Institute of Molecular Genetics and Genetic Engineering, University of Belgrade
Metadata
Show full item recordAbstract
Named entity recognition (NER) is an NLP that involves identifying and classifying named
entities in text. Token classification is a crucial subtask of NER that assumes assigning
labels to individual tokens within a text, indicating the named entity category to which
they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has
emerged as a powerful technique for improving NER performance. By training a pretrained
LLM such as BERT on domain-specific labeled data, the model learns to recognize
named entities specific to that domain with high accuracy. This approach has been applied
to a wide range of domains including biomedical and has demonstrated significant
improvements in NER accuracy.
Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming
and expensive process that requires expert domain knowledge. Also, domains with an
open set of classes yield difficulties in traditional machine learning approaches since the
numb...er of classes to predict needs to be pre-defined.
Our solution to the two mentioned problems is based on data transformation for
factorizing the initial multiple classification problem into a binary one and applying crossencoder-
based BERT architecture for zero- and few-shot learning.
To create our dataset, we transformed six widely used biomedical datasets that contain
various biomedical entities such as genes, drugs, diseases, adverse events, chemicals,
etc., into a uniform format. This transformation process enabled us to merge the datasets
into a single cohesive dataset of 26 different named entity classes.
We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the
NER task in zero- and few-shot settings. The results of the experiment for 9 classes in
zero-shot mode are promising for semantically similar classes and improve significantly
after providing only a few supporting examples for almost all classes. The best results
were obtained using a fine-tuned PubMedBERT model, with average F1 scores of
35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER
respectively.
Keywords:
zero-shot learning / machine learning / deep learning / natural language processing / biomedical named entity recognitionSource:
4th Belgrade Bioinformatics Conference, 2023, 4, 38-38Publisher:
- Belgrade : Institute of molecular genetics and genetic engineering
Note:
- Book of abstract: 4th Belgrade Bioinformatics Conference, June 19-23, 2023
Collections
Institution/Community
Institut za molekularnu genetiku i genetičko inženjerstvoTY - CONF AU - Košprdić, Miloš AU - Prodanović, Nikola AU - Ljajić, Adela AU - Bašaragin, Bojana AU - Milošević, Nikola PY - 2023 UR - https://belbi.bg.ac.rs/ UR - https://imagine.imgge.bg.ac.rs/handle/123456789/1976 AB - Named entity recognition (NER) is an NLP that involves identifying and classifying named entities in text. Token classification is a crucial subtask of NER that assumes assigning labels to individual tokens within a text, indicating the named entity category to which they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has emerged as a powerful technique for improving NER performance. By training a pretrained LLM such as BERT on domain-specific labeled data, the model learns to recognize named entities specific to that domain with high accuracy. This approach has been applied to a wide range of domains including biomedical and has demonstrated significant improvements in NER accuracy. Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming and expensive process that requires expert domain knowledge. Also, domains with an open set of classes yield difficulties in traditional machine learning approaches since the number of classes to predict needs to be pre-defined. Our solution to the two mentioned problems is based on data transformation for factorizing the initial multiple classification problem into a binary one and applying crossencoder- based BERT architecture for zero- and few-shot learning. To create our dataset, we transformed six widely used biomedical datasets that contain various biomedical entities such as genes, drugs, diseases, adverse events, chemicals, etc., into a uniform format. This transformation process enabled us to merge the datasets into a single cohesive dataset of 26 different named entity classes. We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the NER task in zero- and few-shot settings. The results of the experiment for 9 classes in zero-shot mode are promising for semantically similar classes and improve significantly after providing only a few supporting examples for almost all classes. The best results were obtained using a fine-tuned PubMedBERT model, with average F1 scores of 35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER respectively. PB - Belgrade : Institute of molecular genetics and genetic engineering C3 - 4th Belgrade Bioinformatics Conference T1 - Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts EP - 38 SP - 38 VL - 4 UR - https://hdl.handle.net/21.15107/rcub_imagine_1976 ER -
@conference{ author = "Košprdić, Miloš and Prodanović, Nikola and Ljajić, Adela and Bašaragin, Bojana and Milošević, Nikola", year = "2023", abstract = "Named entity recognition (NER) is an NLP that involves identifying and classifying named entities in text. Token classification is a crucial subtask of NER that assumes assigning labels to individual tokens within a text, indicating the named entity category to which they belong. Fine-tuning large language models (LLMs) on labeled domain datasets has emerged as a powerful technique for improving NER performance. By training a pretrained LLM such as BERT on domain-specific labeled data, the model learns to recognize named entities specific to that domain with high accuracy. This approach has been applied to a wide range of domains including biomedical and has demonstrated significant improvements in NER accuracy. Still, data for fine-tuning pre-trained LLMs is large and labeling is a time-consuming and expensive process that requires expert domain knowledge. Also, domains with an open set of classes yield difficulties in traditional machine learning approaches since the number of classes to predict needs to be pre-defined. Our solution to the two mentioned problems is based on data transformation for factorizing the initial multiple classification problem into a binary one and applying crossencoder- based BERT architecture for zero- and few-shot learning. To create our dataset, we transformed six widely used biomedical datasets that contain various biomedical entities such as genes, drugs, diseases, adverse events, chemicals, etc., into a uniform format. This transformation process enabled us to merge the datasets into a single cohesive dataset of 26 different named entity classes. We then fine-tuned two pre-trained language models: BioBERT and PubMedBERT for the NER task in zero- and few-shot settings. The results of the experiment for 9 classes in zero-shot mode are promising for semantically similar classes and improve significantly after providing only a few supporting examples for almost all classes. The best results were obtained using a fine-tuned PubMedBERT model, with average F1 scores of 35.44%, 50.10%, 69.94%, and 79.51% for zero-shot, one-shot, 10-shot, and 100-shot NER respectively.", publisher = "Belgrade : Institute of molecular genetics and genetic engineering", journal = "4th Belgrade Bioinformatics Conference", title = "Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts", pages = "38-38", volume = "4", url = "https://hdl.handle.net/21.15107/rcub_imagine_1976" }
Košprdić, M., Prodanović, N., Ljajić, A., Bašaragin, B.,& Milošević, N.. (2023). Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts. in 4th Belgrade Bioinformatics Conference Belgrade : Institute of molecular genetics and genetic engineering., 4, 38-38. https://hdl.handle.net/21.15107/rcub_imagine_1976
Košprdić M, Prodanović N, Ljajić A, Bašaragin B, Milošević N. Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts. in 4th Belgrade Bioinformatics Conference. 2023;4:38-38. https://hdl.handle.net/21.15107/rcub_imagine_1976 .
Košprdić, Miloš, Prodanović, Nikola, Ljajić, Adela, Bašaragin, Bojana, Milošević, Nikola, "Zero- and Few-Shot Machine Learning for Named Entity Recognition in Biomedical Texts" in 4th Belgrade Bioinformatics Conference, 4 (2023):38-38, https://hdl.handle.net/21.15107/rcub_imagine_1976 .