Kalamazooter commited on
Commit
3874964
1 Parent(s): 20af765

Update Readme.md

Browse files
Files changed (1) hide show
  1. README.md +46 -0
README.md CHANGED
@@ -1,3 +1,49 @@
1
  ---
2
  license: apache-2.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: apache-2.0
3
+ language: nl
4
+ tags:
5
+ - BERTje
6
+ - Filtering
7
+ - Data Cleaning
8
  ---
9
+ ## Model description
10
+
11
+ This model was created with the intention of easily being able to filter large synthetic datasets in the Dutch language.
12
+ It was mostly trained to pick out strings with a lot of repitition, weird grammar or refusals specifically, returning either ["Correct","Error","Refusal"]
13
+
14
+ THIS IS NOT THE FINAL VERSION, MORE ITERATIONS IN THE NEXT FEW WEEKS
15
+ ## How to use
16
+
17
+ ```python
18
+ from transformers import AutoTokenizer, BertForSequenceClassification, pipeline
19
+ import json
20
+ model = BertForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
21
+ tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje", model_max_len=512)
22
+ text_classification = pipeline(
23
+ "text-classification",
24
+ model=model,
25
+ tokenizer=tokenizer,
26
+ )
27
+
28
+ tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}
29
+
30
+ ErrorThreshold = 0.8 #model is slightly trigger happy on the error class, modify this value to your needs
31
+ Dataset = "Base_Dataset"
32
+
33
+ with open(Dataset+".jsonl","r") as DirtyDataset:
34
+ lines = DirtyDataset.readlines()
35
+ for line in lines:
36
+ DatasetDict = json.loads(line)
37
+ output = text_classification(DatasetDict['text'],**tokenizer_kwargs)
38
+ label = output[0]['label']
39
+ score = output[0]['score']
40
+ if label == 'Refusal':
41
+ with open(Dataset+"_Refused.jsonl","a") as RefusalDataset:
42
+ RefusalDataset.writelines([line])
43
+ if label == 'Error' and score > ErrorThreshold:
44
+ with open(Dataset+"_Error.jsonl","a") as ErrorDataset:
45
+ ErrorDataset.writelines([line])
46
+ if label == 'Correct' or (label == 'Error' and score < ErrorThreshold):
47
+ with open(Dataset+"_Clean.jsonl","a") as CorrectDataset:
48
+ CorrectDataset.writelines([line])
49
+ ```