File size: 736 Bytes
12e7f28
e90244f
971d991
6a24ab5
12e7f28
e90244f
12e7f28
 
 
1
2
3
4
5
6
7
8
9
bgc-accession model is a Named Entity Recognition (NER) model that identifies and annotates the accession number of biosynthetic gene clusters in texts. 

The model is a fine-tuned BioBERT model and the training dataset is available in https://gitlab.com/maaly7/emerald_bgcs_annotations 

Testing examples:

1. The genome sequences of Leptolyngbya sp. PCC 7375 (ALVN00000000) and G. sunshinyii YC6258 (NZ_CP007142.1) were obtained previously.36,59
2. K311 was sequenced (NCBI accession number: JN852959) and analyzed with FramePlot and 18 genes were predicted to be involved in echinomycin biosynthesis (Figure 2).
3. The mar cluster was sequenced and annotated and the complete sequence was deposited into Genbank (accession KF711829).