antalvdb
/

bart-base-spelling-nl

Text2Text Generation

Generated from Trainer

Inference Endpoints

Model card Files Files and versions Community

antalvdb commited on Apr 10, 2023

Commit

e7a5ee7

•

1 Parent(s): ec49ce0

Update README.md

Files changed (1) hide show

README.md +22 -7

README.md CHANGED Viewed

@@ -7,27 +7,42 @@ model-index:
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # bart-base-spelling-nl-2m
-This model is a fine-tuned version of [facebook/bart-base](https://huggingface.co/facebook/bart-base) on an unknown dataset.
 It achieves the following results on the evaluation set:
 - Loss: 0.0248
 - Cer: 0.0133
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure

   results: []
 ---
 # bart-base-spelling-nl-2m
+This model is a Dutch fine-tuned version of
+[facebook/bart-base](https://huggingface.co/facebook/bart-base).
 It achieves the following results on the evaluation set:
 - Loss: 0.0248
 - Cer: 0.0133
 ## Model description
+This is a fine-tuned version of
+[facebook/bart-base](https://huggingface.co/facebook/bart-base)
+trained on spelling correction. It leans on the excellent work by
+Oliver Guhr ([github](https://github.com/oliverguhr/spelling),
+[huggingface](https://huggingface.co/oliverguhr/spelling-correction-english-base)). Training
+was performed on an AWS EC2 instance (g5.xlarge) on a single GPU.
 ## Intended uses & limitations
+The intended use for this model is to be a component of the
+[Valkuil.net](https://valkuil.net) context-sensitive spelling
+checker.
 ## Training and evaluation data
+The model was trained on a Dutch dataset composed of 4,964,203 lines
+of text from three public Dutch sources, downloaded from the
+[Opus corpus](https://opus.nlpl.eu/):
+- nl-europarlv7.1m.txt (2,000,000 lines)
+- nl-opensubtitles2016.1m.txt (2,000,000 lines)
+- nl-wikipedia.txt (964,203 lines)
+Together these texts comprise 73,818,804 tokens.
 ## Training procedure