TY - JOUR
T1 - Increasing metadata coverage of SRA BioSample entries using deep learning-based named entity recognition
AU - Klie, Adam
AU - Tsui, Brian Y.
AU - Mollah, Shamim
AU - Skola, Dylan
AU - Dow, Michelle
AU - Hsu, Chun Nan
AU - Carter, Hannah
N1 - Funding Information:
This work was supported by the National Institutes of Health [grant numbers T32GM8806, DP5OD017937]; infrastructure was funded by the National Institutes of Health [grant number 2P41GM103504-11]; H.C. is supported by the Canadian Institute for Advanced Research [award number FL-000655].
Publisher Copyright:
© 2021 The Author(s). Published by Oxford University Press.
PY - 2021
Y1 - 2021
N2 - High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE
AB - High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information's Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute-value pairs from SRA BioSample to train a scalable, recurrent neural network that predicts missing metadata via named entity recognition (NER). The network was first trained to classify short text phrases according to 11 metadata categories and achieved an overall accuracy and area under the receiver operating characteristic curve of 85.2% and 0.977, respectively. We then applied our classifier to predict 11 metadata categories from the longer TITLE attribute of samples, evaluating performance on a set of samples withheld from model training. Prediction accuracies were high when extracting sample Genus/Species (94.85%), Condition/Disease (95.65%) and Strain (82.03%) from TITLEs, with lower accuracies and lack of predictions for other categories highlighting multiple issues with the current metadata annotations in BioSample. These results indicate the utility of recurrent neural networks for NER-based metadata prediction and the potential for models such as the one presented here to increase metadata coverage in BioSample while minimizing the need for manual curation. Database URL: https://github.com/cartercompbio/PredictMEE
UR - http://www.scopus.com/inward/record.url?scp=85105090511&partnerID=8YFLogxK
U2 - 10.1093/database/baab021
DO - 10.1093/database/baab021
M3 - Article
C2 - 33914028
AN - SCOPUS:85105090511
SN - 1758-0463
VL - 2021
JO - Database
JF - Database
M1 - baab021
ER -