File size: 1,899 Bytes
3edc57f
 
3874964
 
 
 
 
3edc57f
3874964
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
---
license: apache-2.0
language: nl
tags:
- BERTje
- Filtering
- Data Cleaning
---
## Model description

This model was created with the intention of easily being able to filter large synthetic datasets in the Dutch language.
It was mostly trained to pick out strings with a lot of repitition, weird grammar or refusals specifically, returning either ["Correct","Error","Refusal"]

THIS IS NOT THE FINAL VERSION, MORE ITERATIONS IN THE NEXT FEW WEEKS
## How to use

```python
from transformers import AutoTokenizer, BertForSequenceClassification, pipeline
import json
model = BertForSequenceClassification.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje")
tokenizer = AutoTokenizer.from_pretrained("Kalamazooter/DutchDatasetCleaner_Bertje", model_max_len=512)
text_classification = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
)

tokenizer_kwargs = {'padding':True,'truncation':True,'max_length':512}

ErrorThreshold = 0.8 #model is slightly trigger happy on the error class, modify this value to your needs
Dataset = "Base_Dataset"

with open(Dataset+".jsonl","r") as DirtyDataset:
    lines = DirtyDataset.readlines()
    for line in lines:
        DatasetDict = json.loads(line)
        output = text_classification(DatasetDict['text'],**tokenizer_kwargs)
        label = output[0]['label']
        score = output[0]['score']
        if label == 'Refusal':
            with open(Dataset+"_Refused.jsonl","a") as RefusalDataset:
                RefusalDataset.writelines([line])
        if label == 'Error' and score > ErrorThreshold:
            with open(Dataset+"_Error.jsonl","a") as ErrorDataset:
                ErrorDataset.writelines([line])
        if label == 'Correct' or (label == 'Error' and score < ErrorThreshold): 
            with open(Dataset+"_Clean.jsonl","a") as CorrectDataset:
                CorrectDataset.writelines([line])
```