--- library_name: transformers base_model: bert-base-chinese tags: - generated_from_trainer datasets: - cmrc2018 model-index: - name: chinese_qa results: [] --- # bert-base-chinese-finetuned-cmrc2018 This model is a fine-tuned version of [bert-base-chinese](https://huggingface.co/bert-base-chinese) on the CMRC2018 (Chinese Machine Reading Comprehension) dataset. ## Model Description This is a BERT-based extractive question answering model for Chinese text. The model is designed to locate and extract answer spans from given contexts in response to questions. Key Features: - Base Model: bert-base-chinese - Task: Extractive Question Answering - Language: Chinese - Training Dataset: CMRC2018 ## Performance Metrics Evaluation results on the test set: - Exact Match: 59.708 - F1 Score: 60.0723 - Number of evaluation samples: 6,254 - Evaluation speed: 283.054 samples/second ## Intended Uses & Limitations ### Intended Uses - Chinese reading comprehension tasks - Answer extraction from given documents - Context-based question answering systems ### Limitations - Only supports extractive QA (cannot generate new answers) - Answers must be present in the context - Does not support multi-hop reasoning - Cannot handle unanswerable questions ## Training Details ### Training Hyperparameters - Learning rate: 3e-05 - Train batch size: 12 - Eval batch size: 8 - Seed: 42 - Optimizer: AdamW (betas=(0.9,0.999), epsilon=1e-08) - LR scheduler: linear - Number of epochs: 5.0 ### Training Results - Training time: 892.86 seconds - Training samples: 18,960 - Training speed: 106.175 samples/second - Training loss: 0.5625 ### Framework Versions - Transformers: 4.47.0.dev0 - Pytorch: 2.5.1+cu124 - Datasets: 3.1.0 - Tokenizers: 20.3 ## Usage ```python import torch from transformers import AutoModelForQuestionAnswering, AutoTokenizer # Load model and tokenizer model = AutoModelForQuestionAnswering.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018") tokenizer = AutoTokenizer.from_pretrained("real-jiakai/bert-base-chinese-finetuned-cmrc2018") # Prepare inputs question = "长城有多长?" context = "长城是中国古代的伟大建筑工程,全长超过2万公里,横跨中国北部多个省份。" # Tokenize inputs inputs = tokenizer( question, context, return_tensors="pt", max_length=384, truncation=True ) # Get answer outputs = model(**inputs) answer_start = torch.argmax(outputs.start_logits) answer_end = torch.argmax(outputs.end_logits) + 1 answer = tokenizer.decode(inputs["input_ids"][0][answer_start:answer_end]) print("Answer:", answer) ``` ## Citation If you use this model, please cite the CMRC2018 dataset: ```bibtex @inproceedings{cui-emnlp2019-cmrc2018, title = "A Span-Extraction Dataset for {C}hinese Machine Reading Comprehension", author = "Cui, Yiming and Liu, Ting and Che, Wanxiang and Xiao, Li and Chen, Zhipeng and Ma, Wentao and Wang, Shijin and Hu, Guoping", booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)", month = nov, year = "2019", address = "Hong Kong, China", publisher = "Association for Computational Linguistics", url = "https://www.aclweb.org/anthology/D19-1600", doi = "10.18653/v1/D19-1600", pages = "5886--5891", } ```