ccasimiro commited on
Commit
f7ae865
1 Parent(s): 5a46ae0

update readme

Browse files
Files changed (1) hide show
  1. README.md +9 -9
README.md CHANGED
@@ -40,14 +40,14 @@ The training corpus is composed of several biomedical corpora in Spanish, collec
40
 
41
  | Name | No. tokens | Description |
42
  |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
43
- | [Medical crawler](https://zenodo.org/record/4561971#.YTtwM32xXbQ) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish health domain |
44
- | Scielo | 60,007,289 | Collection of biomedical literature in Spanish crawled from the Scielo repository in 2019 |
45
- | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines |
46
- | Wikipedia_life_sciences | 13,890,501 | Wikipedia articles beloging to the Life Sciences category crawled on 04/01/2021 |
47
- | Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P" |
48
- | [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from the a parallel corpus made out of PDF documents from the European Medicines Agency. |
49
- | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side documents extracted from a collection of Spanish-English parallel corpora consistiing of biomedical scientific literature. The collection of parallel resources are aggregated from the IBECS, SciELO, Pubmed and MedlinePlus sources. |
50
- | PubMed | 1,858,966 | Collection of biomedical literature in Spanish crawled from the PubMed repository in 2019 |
51
 
52
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
53
 
@@ -84,7 +84,7 @@ The evaluation results are compared against the [mBERT](https://huggingface.co/b
84
 
85
  The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
86
 
87
- However, the is intended to be fine-tuned on non-generative downstream tasks such as Question Answering, Text Classification or Named Entity Recognition.
88
 
89
  ---
90
 
 
40
 
41
  | Name | No. tokens | Description |
42
  |-----------------------------------------------------------------------------------------|-------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
43
+ | [Medical crawler](https://zenodo.org/record/4561971#.YTtwM32xXbQ) | 745,705,946 | Crawler of more than 3,000 URLs belonging to Spanish biomedical and health domains. |
44
+ | [Scielo](https://github.com/PlanTL-SANIDAD/SciELO-Spain-Crawler) | 60,007,289 | Publications written in Spanish crawled from the Spanish SciELO server in 2017. |
45
+ | [BARR2_background](https://temu.bsc.es/BARR2/downloads/background_set.raw_text.tar.bz2) | 24,516,442 | Biomedical Abbreviation Recognition and Resolution (BARR2) containing Spanish clinical case study sections from a variety of clinical disciplines. |
46
+ | Wikipedia_life_sciences | 13,890,501 | Wikipedia articles belonging to the Life Sciences category crawled on 04/01/2021 |
47
+ | Patents | 13,463,387 | Google Patent in Medical Domain for Spain (Spanish). The accepted codes (Medical Domain) for Json files of patents are: "A61B", "A61C","A61F", "A61H", "A61K", "A61L","A61M", "A61B", "A61P". |
48
+ | [EMEA](http://opus.nlpl.eu/download.php?f=EMEA/v3/moses/en-es.txt.zip) | 5,377,448 | Spanish-side documents extracted from parallel corpora made out of PDF documents from the European Medicines Agency. |
49
+ | [mespen_Medline](https://zenodo.org/record/3562536#.YTt1fH2xXbR) | 4,166,077 | Spanish-side articles extracted from a collection of Spanish-English parallel corpus consisting of biomedical scientific literature. The collection of parallel resources are aggregated from the MedlinePlus source. |
50
+ | PubMed | 1,858,966 | Open-access articles from the PubMed repository crawled in 2017. |
51
 
52
  To obtain a high-quality training corpus, a cleaning pipeline with the following operations has been applied:
53
 
 
84
 
85
  The model is ready-to-use only for masked language modelling to perform the Fill Mask task (try the inference API or read the next section)
86
 
87
+ However, the is intended to be fine-tuned on downstream tasks such as Named Entity Recognition or Text Classification.
88
 
89
  ---
90