mdeberta-v3-ud-thai-pud-upos / README.md

Pavarissy

Update README.md

30cc7db 10 months ago

preview code

raw

history blame contribute delete

No virus

5.5 kB

	---
	license: mit
	base_model: microsoft/mdeberta-v3-base
	tags:
	- generated_from_trainer
	datasets:
	- universal_dependencies
	metrics:
	- accuracy
	- precision
	- recall
	model-index:
	- name: mdeberta-v3-ud-thai-pud-upos
	results:
	- task:
	name: Token Classification
	type: token-classification
	dataset:
	name: universal_dependencies
	type: universal_dependencies
	config: th_pud
	split: test
	args: th_pud
	metrics:
	- name: Accuracy
	type: accuracy
	value: 0.9934846474601972
	widget:
	- text: นักวิจัยกล่าวว่าการวิเคราะห์ดีเอ็นเอของเนื้องอกอาจช่วยอธิบายถึงสาเหตุที่แท้จริงของมะเร็งชนิดอื่นๆ ได้
	example_title: test_example_1
	- text: >-
	คือผมไม่ได้ชอบกดดันพวกคุณหรอกนะ แต่ชะตากรรมของสาธารณรัฐอยู่ในกำมือคุณ
	example_title: test_example_2

	language:
	- th
	library_name: transformers
	---

	<!-- This model card has been generated automatically according to the information the Trainer had access to. You
	should probably proofread and complete it, then remove this comment. -->

	# mdeberta-v3-ud-thai-pud-upos

	This model is a fine-tuned version of [microsoft/mdeberta-v3-base](https://huggingface.co/microsoft/mdeberta-v3-base) on the universal_dependencies dataset.
	It achieves the following results on the evaluation set:
	- Loss: 0.0303
	- Macro avg precision: 0.9235
	- Macro avg recall: 0.9228
	- Macro avg f1: 0.9231
	- Weighted avg precision: 0.9935
	- Weighted avg recall: 0.9935
	- Weighted avg f1: 0.9935
	- Accuracy: 0.9935

	## Model description

	This model is train on thai UD Thai PUD corpus with `Universal Part-of-speech (UPOS)` tag to help with pos tagging in Thai language.

	## Example
	```python
	from transformers import AutoModelForTokenClassification, AutoTokenizer, TokenClassificationPipeline

	model = AutoModelForTokenClassification.from_pretrained("Pavarissy/mdeberta-v3-ud-thai-pud-upos")
	tokenizer = AutoTokenizer.from_pretrained("Pavarissy/mdeberta-v3-ud-thai-pud-upos")

	pipeline = TokenClassificationPipeline(model=model, tokenizer=tokenizer, grouped_entities=True)
	outputs = pipeline("ประเทศไทย อยู่ใน ทวีป เอเชีย")
	print(outputs)
	# [{'entity_group': 'PROPN', 'score': 0.9946701, 'word': 'ประเทศไทย', 'start': 0, 'end': 9}, {'entity_group': 'VERB', 'score': 0.85809743, 'word': 'อยู่ใน', 'start': 9, 'end': 16}, {'entity_group': 'NOUN', 'score': 0.99632, 'word': 'ทวีป', 'start': 16, 'end': 21}, {'entity_group': 'PROPN', 'score': 0.9961184, 'word': 'เอเชีย', 'start': 21, 'end': 28}]

	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 2e-05
	- train_batch_size: 8
	- eval_batch_size: 8
	- seed: 42
	- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
	- lr_scheduler_type: linear
	- num_epochs: 10

	### Training results

	\| Training Loss \| Epoch \| Step \| Validation Loss \| Macro avg precision \| Macro avg recall \| Macro avg f1 \| Weighted avg precision \| Weighted avg recall \| Weighted avg f1 \| Accuracy \|
	\|:-------------:\|:-----:\|:----:\|:---------------:\|:-------------------:\|:----------------:\|:------------:\|:----------------------:\|:-------------------:\|:---------------:\|:--------:\|
	\| No log \| 1.0 \| 125 \| 0.3898 \| 0.8417 \| 0.7849 \| 0.8078 \| 0.9119 \| 0.9112 \| 0.9101 \| 0.9112 \|
	\| No log \| 2.0 \| 250 \| 0.1768 \| 0.8765 \| 0.8683 \| 0.8720 \| 0.9561 \| 0.9560 \| 0.9559 \| 0.9560 \|
	\| No log \| 3.0 \| 375 \| 0.1217 \| 0.8972 \| 0.8892 \| 0.8929 \| 0.9701 \| 0.9701 \| 0.9699 \| 0.9701 \|
	\| 0.4709 \| 4.0 \| 500 \| 0.0841 \| 0.9057 \| 0.9064 \| 0.9059 \| 0.9802 \| 0.9800 \| 0.9800 \| 0.9800 \|
	\| 0.4709 \| 5.0 \| 625 \| 0.0649 \| 0.9128 \| 0.9133 \| 0.9130 \| 0.9854 \| 0.9853 \| 0.9853 \| 0.9853 \|
	\| 0.4709 \| 6.0 \| 750 \| 0.0513 \| 0.9147 \| 0.9170 \| 0.9158 \| 0.9878 \| 0.9877 \| 0.9877 \| 0.9877 \|
	\| 0.4709 \| 7.0 \| 875 \| 0.0423 \| 0.9199 \| 0.9180 \| 0.9189 \| 0.9900 \| 0.9900 \| 0.9900 \| 0.9900 \|
	\| 0.0857 \| 8.0 \| 1000 \| 0.0350 \| 0.9226 \| 0.9207 \| 0.9216 \| 0.9921 \| 0.9921 \| 0.9921 \| 0.9921 \|
	\| 0.0857 \| 9.0 \| 1125 \| 0.0318 \| 0.9237 \| 0.9219 \| 0.9228 \| 0.9932 \| 0.9932 \| 0.9932 \| 0.9932 \|
	\| 0.0857 \| 10.0 \| 1250 \| 0.0303 \| 0.9235 \| 0.9228 \| 0.9231 \| 0.9935 \| 0.9935 \| 0.9935 \| 0.9935 \|


	### Framework versions

	- Transformers 4.34.1
	- Pytorch 2.1.0+cu118
	- Datasets 2.14.6
	- Tokenizers 0.14.1