File size: 2,776 Bytes
eb31b54
 
332eb14
 
eb31b54
 
ea80b69
 
eb31b54
 
 
 
 
6863a7e
 
6e02b29
eb31b54
 
 
 
 
4ab232f
eb31b54
2dfc11c
eb31b54
2c46680
 
 
2dfc11c
eb31b54
a8c126a
 
5a998c0
5379057
5a998c0
 
 
 
 
 
 
 
 
a8c126a
62e8c0c
eb31b54
2dfc11c
 
eb31b54
4ab232f
2dfc11c
37df51c
 
2dfc11c
eb31b54
768d0d8
 
34307dc
d96765c
eb31b54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
---
license: mit
language: 
- ja
tags:
- generated_from_trainer
- ner
- bert
metrics:
- f1
model-index:
- name: xlm-roberta-ner-ja
  results: []
widget:
- text: "鈴木は4月の陽気の良い日に、鈴をつけて熊本県の阿蘇山に登った"
- text: "中国では、中国共産党による一党統治が続く"
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# xlm-roberta-ner-japanese

(Japanese caption : 日本語の固有表現抽出のモデル)

This model is a fine-tuned version of [xlm-roberta-base](https://huggingface.co/xlm-roberta-base) (pre-trained cross-lingual ```RobertaModel```) trained for named entity recognition (NER) token classification.

The model is fine-tuned on NER dataset provided by Stockmark Inc, in which data is collected from Japanese Wikipedia articles.<br>
See [here](https://github.com/stockmarkteam/ner-wikipedia-dataset) for the license of this dataset.

Each token is labeled by :

| Label id | Tag | Tag in Widget | Description |
|---|---|---|---|
| 0 | O | (None) | others or nothing |
| 1 | PER | PER | person |
| 2 | ORG | ORG | general corporation organization |
| 3 | ORG-P | P | political organization |
| 4 | ORG-O | O | other organization |
| 5 | LOC | LOC | location |
| 6 | INS | INS | institution, facility |
| 7 | PRD | PRD | product |
| 8 | EVT | EVT | event |

## Intended uses

```python
from transformers import pipeline

model_name = "tsmatz/xlm-roberta-ner-japanese"
classifier = pipeline("token-classification", model=model_name)
result = classifier("鈴木は4月の陽気の良い日に、鈴をつけて熊本県の阿蘇山に登った")
print(result)
```

## Training procedure

You can download the source code for fine-tuning from [here](https://github.com/tsmatz/huggingface-finetune-japanese/blob/master/01-named-entity.ipynb).

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 12
- eval_batch_size: 12
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- num_epochs: 5

### Training results

| Training Loss | Epoch | Step | Validation Loss | F1     |
|:-------------:|:-----:|:----:|:---------------:|:------:|
| No log        | 1.0   | 446  | 0.1510          | 0.8457 |
| No log        | 2.0   | 892  | 0.0626          | 0.9261 |
| No log        | 3.0   | 1338 | 0.0366          | 0.9580 |
| No log        | 4.0   | 1784 | 0.0196          | 0.9792 |
| No log        | 5.0   | 2230 | 0.0173          | 0.9864 |


### Framework versions

- Transformers 4.23.1
- Pytorch 1.12.1+cu102
- Datasets 2.6.1
- Tokenizers 0.13.1