KoichiYasuoka commited on
Commit
a2e6374
1 Parent(s): 788240a

initial release

Browse files
README.md ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - "ko"
4
+ tags:
5
+ - "korean"
6
+ - "token-classification"
7
+ - "pos"
8
+ - "dependency-parsing"
9
+ datasets:
10
+ - "universal_dependencies"
11
+ license: "cc-by-sa-4.0"
12
+ pipeline_tag: "token-classification"
13
+ widget:
14
+ - text: "홍시 맛이 나서 홍시라 생각한다."
15
+ ---
16
+
17
+ # roberta-base-korean-morph-upos
18
+
19
+ ## Model Description
20
+
21
+ This is a RoBERTa model for POS-tagging and dependency-parsing, derived from [klue/roberta-base](https://huggingface.co/klue/roberta-base) and [morphUD-korean](https://github.com/jungyeul/morphUD-korean). Every morpheme is tagged by [UPOS](https://universaldependencies.org/u/pos/)(Universal Part-Of-Speech).
22
+
23
+ ## How to Use
24
+
25
+ ```py
26
+ from transformers import AutoTokenizer,AutoModelForTokenClassification,TokenClassificationPipeline
27
+ tokenizer=AutoTokenizer.from_pretrained("KoichiYasuoka/roberta-base-korean-morph-upos")
28
+ model=AutoModelForTokenClassification.from_pretrained("KoichiYasuoka/roberta-base-korean-morph-upos")
29
+ pipeline=TokenClassificationPipeline(tokenizer=tokenizer,model=model,aggregation_strategy="simple")
30
+ nlp=lambda x:[(x[t["start"]:t["end"]],t["entity_group"]) for t in pipeline(x)]
31
+ print(nlp("홍시 맛이 나서 홍시라 생각한다."))
32
+ ```
33
+
34
+ or
35
+
36
+ ```py
37
+ import esupar
38
+ nlp=esupar.load("KoichiYasuoka/roberta-base-korean-morph-upos")
39
+ print(nlp("홍시 맛이 나서 홍시라 생각한다."))
40
+ ```
41
+
42
+ ## See Also
43
+
44
+ [esupar](https://github.com/KoichiYasuoka/esupar): Tokenizer POS-tagger and Dependency-parser with BERT/RoBERTa/DeBERTa models
config.json ADDED
The diff for this file is too large to render. See raw diff
 
pytorch_model.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:f7d1fc5576bab4b9c9c47791db5edc1b9c0f977645882469dd5bb599b816d656
3
+ size 442924913
special_tokens_map.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "eos_token": "[SEP]",
5
+ "mask_token": "[MASK]",
6
+ "pad_token": "[PAD]",
7
+ "sep_token": "[SEP]",
8
+ "unk_token": "[UNK]"
9
+ }
supar.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:4c0bf237ca850fab143c201dbc0e710887cb554efc04bc487304df682cf101ee
3
+ size 490658725
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,16 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "bos_token": "[CLS]",
3
+ "cls_token": "[CLS]",
4
+ "do_basic_tokenize": true,
5
+ "do_lower_case": false,
6
+ "eos_token": "[SEP]",
7
+ "mask_token": "[MASK]",
8
+ "model_max_length": 512,
9
+ "never_split": null,
10
+ "pad_token": "[PAD]",
11
+ "sep_token": "[SEP]",
12
+ "strip_accents": null,
13
+ "tokenize_chinese_chars": true,
14
+ "tokenizer_class": "BertTokenizerFast",
15
+ "unk_token": "[UNK]"
16
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff