--- license: apache-2.0 pipeline_tag: text-classification language: - yue widget: - text: 係唔係去食飯? example_title: Cantonese - text: 台灣真美! example_title: Traditional Chinese datasets: - raptorkwok/cantonese-traditional-chinese-parallel-corpus --- ## Model Description A BERT-based model trained to classify text as either Cantonese or Traditional Chinese. ## Intended Use - **Primary Application**: Language classification for Cantonese and Traditional Chinese texts. - **Users**: NLP researchers, developers working with Chinese language data. ## Training Data Utilizes the "raptorkwok/cantonese-traditional-chinese-parallel-corpus" from Hugging Face Datasets. ## Training Procedure - **Base Model**: `bert-base-chinese` - **Epochs**: 3 - **Learning Rate**: 2e-5 ## How to Use ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("ming030890/chinese-langid") model = AutoModelForSequenceClassification.from_pretrained("ming030890/chinese-langid") text = "係唔係廣東話?" inputs = tokenizer(text, return_tensors="pt") outputs = model(**inputs) # 0 for Cantonese, 1 for Traditional Chinese prediction = outputs.logits.argmax(-1).item() ```