File size: 3,583 Bytes
9844078
 
 
 
 
 
 
 
 
 
 
 
 
c373333
9844078
dcf2f4f
c373333
9844078
 
c373333
9844078
 
c373333
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9844078
 
c373333
 
9844078
 
c373333
9844078
 
 
 
 
dcf2f4f
9844078
 
 
 
 
dcf2f4f
9844078
 
 
 
 
dcf2f4f
 
 
 
 
9844078
 
 
 
 
 
 
c373333
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
---
license: apache-2.0
base_model: line-corporation/line-distilbert-base-japanese
tags:
- generated_from_trainer
model-index:
- name: fluency-score-classification-ja
  results: []
---


# fluency-score-classification-ja

This model is a fine-tuned version of [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) on the ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main).
It achieves the following results on the evaluation set:
- Loss: 0.1912
- ROC AUC: 0.9811

## Model description
This model wraps [line-corporation/line-distilbert-base-japanese](https://huggingface.co/line-corporation/line-distilbert-base-japanese) with [DistilBertForSequenceClassification](https://huggingface.co/docs/transformers/v4.34.0/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) to make a binary classifier.

## Intended uses & limitations
This model can be used to classify whether the given Japanese texts are fluent (i.e., not having grammactical errors).
Example usage:

```python
# Load the tokenizer & the model
from transformers import AutoTokenizer,  AutoModelForSequenceClassification
import torch

tokenizer = AutoTokenizer.from_pretrained("line-corporation/line-distilbert-base-japanese", trust_remote_code=True)
model =  AutoModelForSequenceClassification.from_pretrained("liwii/fluency-score-classification-ja")

# Make predictions
input_tokens = tokenizer([
        '黒い猫が',
        '黒い猫がいます',
        'あっちの方で黒い猫があくびをしています',
        'あっちの方でで黒い猫ががあくびをしています',
        'ある日の暮方の事である。一人の下人が、羅生門の下で雨やみを待っていた。'
    ],
    return_tensors='pt',
    padding=True)

output = model(**input_tokens)
with torch.no_grad():
    # Probabilities of [not_fluent, fluent]
    probs = torch.nn.functional.softmax(
        output.logits, dim=1)
probs[:, 1] # => tensor([0.1007, 0.2416, 0.5635, 0.0453, 0.7701])
```

The scores could be low for short sentences even if they do not contain any grammatical erros because the training dataset consist of long sentences.

## Training and evaluation data
From ["日本語文法誤りデータセット"](https://github.com/liwii/ja_perturbed/tree/main), used 512 rows as the evaluation dataset and the rest of the dataset as the training dataset.
For each dataset split, Used the "original" rows as the data with "fluent" label, and "perturbed" as the data with "not fluent" data.

## Training procedure
Fine-tuned the model for 5 epochs. Freezed the params in the original DistilBERT during the fine-duning.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 64
- eval_batch_size: 8
- seed: 42
- distributed_type: tpu
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | Roc Auc |
|:-------------:|:-----:|:----:|:---------------:|:-------:|
| 0.4582        | 1.0   | 647  | 0.2887          | 0.9679  |
| 0.2664        | 2.0   | 1294 | 0.2224          | 0.9761  |
| 0.2177        | 3.0   | 1941 | 0.2047          | 0.9793  |
| 0.1899        | 4.0   | 2588 | 0.1944          | 0.9807  |
| 0.1865        | 5.0   | 3235 | 0.1912          | 0.9811  |


### Framework versions

- Transformers 4.34.0
- Pytorch 2.0.0+cu118
- Datasets 2.14.5
- Tokenizers 0.14.0