--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards {} --- # Model Card for Model ID Developed in a joint effort between the University of Florida, NVIDIA, and IKIM, GeBERTa is a series of German DeBERTa models ranging between 122M and 750M parameters. The pre-training dataset consists of documents from different domains: | Category | Source Data | Data Size | #Docs | #Tokens | | -------- | ----------- | --------- | ------ | ------- | | Formal | Wikipedia | 9GB | 2,665,357 | 1.9B | | Formal | News | 28GB | 12,305,326 | 6.1B | | Formal | GC4 | 90GB | 31,669,772 | 19.4B | | Informal | Reddit 2019-2023 (GER) | 5.8GB | 15,036,592 | 1.3B | | Informal | Holiday Reviews | 2GB | 4,876,405 | 428M | | Legal | OpenLegalData: German cases and laws | 5.4GB | 308,228 | 1B | | Medical | Charite doctoral theses abstracts | 28MB | 16,947 | 5M | | Medical | Flexikon | 106MB | 74,136 | 23M | | Medical | NTS of Animal Experiments | 24MB | 50,310 | 4M | | Medical | German Guideline Program in Oncology | 13MB | 4,348 | 3M | | Medical | Springer Abstract | 79MB | 34,035 | 15M | | Medical | CC medical texts (GER) | 3.6GB | 2,000,000 | 682M | | Medical | Medicine Dissertations | 1.4GB | 14,496 | 295M | | Medical | Pubmed abstracts | 8.5GB | 21,044,382 | 1.7B | | Medical | MIMIC III | 2.6GB | 24,221,834 | 695M | | Medical | PMC-Patients-ReCDS | 2.1GB | 1,743,344 | 414M | | Literature | German Fiction | 1.1GB | 3,219 | 243M | | Literature | English books | 7.1GB | 11,038 | 1.6B | | - | Total | 167GB | 116,079,769 | 35.8B |