julien-c HF staff commited on
Commit
7c01e18
1 Parent(s): 4b427c5

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/bionlp/bluebert_pubmed_uncased_L-12_H-768_A-12/README.md

Files changed (1) hide show
  1. README.md +60 -0
README.md ADDED
@@ -0,0 +1,60 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ tags:
5
+ - bluebert
6
+ license:
7
+ - PUBLIC DOMAIN NOTICE
8
+ datasets:
9
+ - pubmed
10
+
11
+ ---
12
+
13
+ # BlueBert-Base, Uncased, PubMed
14
+
15
+ ## Model description
16
+
17
+ A BERT model pre-trained on PubMed abstracts
18
+
19
+ ## Intended uses & limitations
20
+
21
+ #### How to use
22
+
23
+ Please see https://github.com/ncbi-nlp/bluebert
24
+
25
+ ## Training data
26
+
27
+ We provide [preprocessed PubMed texts](https://ftp.ncbi.nlm.nih.gov/pub/lu/Suppl/NCBI-BERT/pubmed_uncased_sentence_nltk.txt.tar.gz) that were used to pre-train the BlueBERT models.
28
+ The corpus contains ~4000M words extracted from the [PubMed ASCII code version](https://www.ncbi.nlm.nih.gov/research/bionlp/APIs/BioC-PubMed/).
29
+
30
+ Pre-trained model: https://huggingface.co/bert-base-uncased
31
+
32
+ ## Training procedure
33
+
34
+ * lowercasing the text
35
+ * removing speical chars `\x00`-`\x7F`
36
+ * tokenizing the text using the [NLTK Treebank tokenizer](https://www.nltk.org/_modules/nltk/tokenize/treebank.html)
37
+
38
+ Below is a code snippet for more details.
39
+
40
+ ```python
41
+ value = value.lower()
42
+ value = re.sub(r'[\r\n]+', ' ', value)
43
+ value = re.sub(r'[^\x00-\x7F]+', ' ', value)
44
+
45
+ tokenized = TreebankWordTokenizer().tokenize(value)
46
+ sentence = ' '.join(tokenized)
47
+ sentence = re.sub(r"\s's\b", "'s", sentence)
48
+ ```
49
+
50
+ ### BibTeX entry and citation info
51
+
52
+ ```bibtex
53
+ @InProceedings{peng2019transfer,
54
+ author = {Yifan Peng and Shankai Yan and Zhiyong Lu},
55
+ title = {Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets},
56
+ booktitle = {Proceedings of the 2019 Workshop on Biomedical Natural Language Processing (BioNLP 2019)},
57
+ year = {2019},
58
+ pages = {58--65},
59
+ }
60
+ ```