pszemraj commited on
Commit
5c30310
1 Parent(s): 642c619

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -18
README.md CHANGED
@@ -2,36 +2,29 @@
2
  license: apache-2.0
3
  base_model: pszemraj/MiniLMv2-L6-H384_R-fineweb-100k
4
  tags:
5
- - generated_from_trainer
 
 
6
  metrics:
7
  - accuracy
8
- model-index:
9
- - name: MiniLMv2-L6-H384_R-fineweb-100k-OCR-quality-classification-cls
10
- results: []
 
11
  ---
12
 
13
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
14
- should probably proofread and complete it, then remove this comment. -->
15
 
16
- # MiniLMv2-L6-H384_R-fineweb-100k-OCR-quality-classification-cls
17
-
18
- This model is a fine-tuned version of [pszemraj/MiniLMv2-L6-H384_R-fineweb-100k](https://huggingface.co/pszemraj/MiniLMv2-L6-H384_R-fineweb-100k) on an unknown dataset.
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.0162
21
  - Accuracy: 0.996
22
  - Num Input Tokens Seen: 61536256
23
 
24
- ## Model description
25
-
26
- More information needed
27
 
28
  ## Intended uses & limitations
29
 
30
- More information needed
31
-
32
- ## Training and evaluation data
33
-
34
- More information needed
35
 
36
  ## Training procedure
37
 
@@ -67,4 +60,4 @@ The following hyperparameters were used during training:
67
  - Transformers 4.40.2
68
  - Pytorch 2.2.0+cu121
69
  - Datasets 2.19.1
70
- - Tokenizers 0.19.1
 
2
  license: apache-2.0
3
  base_model: pszemraj/MiniLMv2-L6-H384_R-fineweb-100k
4
  tags:
5
+ - data processing
6
+ - data filter
7
+ - text quality
8
  metrics:
9
  - accuracy
10
+ datasets:
11
+ - pszemraj/OCR-quality-classification
12
+ language:
13
+ - en
14
  ---
15
 
16
+ # MiniLMv2-L6-H384_R-OCR-quality
 
17
 
18
+ This model is a fine-tuned version of [pszemraj/MiniLMv2-L6-H384_R-fineweb-100k](https://hf.co/pszemraj/MiniLMv2-L6-H384_R-fineweb-100k) on `pszemraj/OCR-quality-classification`
 
 
19
  It achieves the following results on the evaluation set:
20
  - Loss: 0.0162
21
  - Accuracy: 0.996
22
  - Num Input Tokens Seen: 61536256
23
 
 
 
 
24
 
25
  ## Intended uses & limitations
26
 
27
+ predict whether a document is clean or noisy
 
 
 
 
28
 
29
  ## Training procedure
30
 
 
60
  - Transformers 4.40.2
61
  - Pytorch 2.2.0+cu121
62
  - Datasets 2.19.1
63
+ - Tokenizers 0.19.1