|
--- |
|
license: apache-2.0 |
|
pipeline_tag: text-classification |
|
language: |
|
- yue |
|
widget: |
|
- text: 係唔係去食飯? |
|
example_title: "Cantonese" |
|
- text: 台灣真美! |
|
example_title: "Traditional Chinese" |
|
--- |
|
|
|
## Model Description |
|
A BERT-based model trained to classify text as either Cantonese or Traditional Chinese. |
|
|
|
## Intended Use |
|
- **Primary Application**: Language classification for Cantonese and Traditional Chinese texts. |
|
- **Users**: NLP researchers, developers working with Chinese language data. |
|
|
|
## Training Data |
|
Utilizes the "raptorkwok/cantonese-traditional-chinese-parallel-corpus" from Hugging Face Datasets. |
|
|
|
## Training Procedure |
|
- **Base Model**: `bert-base-chinese` |
|
- **Epochs**: 3 |
|
- **Learning Rate**: 2e-5 |
|
|
|
## How to Use |
|
```python |
|
from transformers import AutoTokenizer, AutoModelForSequenceClassification |
|
tokenizer = AutoTokenizer.from_pretrained("ming030890/chinese-langid") |
|
model = AutoModelForSequenceClassification.from_pretrained("ming030890/chinese-langid") |
|
text = "係唔係廣東話?" |
|
inputs = tokenizer(text, return_tensors="pt") |
|
outputs = model(**inputs) |
|
# 0 for Cantonese, 1 for Traditional Chinese |
|
prediction = outputs.logits.argmax(-1).item() |
|
``` |
|
|