--- license: afl-3.0 datasets: - issai/kazqad language: - kk library_name: transformers pipeline_tag: question-answering base_model: nur-dev/roberta-kaz-large --- # RoBERTa-Large-KazQAD for Question Answering ## Model Description **RoBERTa-Large-KazQAD** is a fine-tuned version of [RoBERTa-Kaz-Large](https://huggingface.co/nur-dev/roberta-kaz-large), specifically optimized for the Question Answering (QA) task using the Kazakh Open-Domain Question Answering Dataset (KazQAD). This model is trained to extract precise answers from given contexts in the Kazakh language. ### Fine-Tuning Details This model was fine-tuned on the KazQAD dataset, which is a Kazakh open-domain question-answering dataset. The fine-tuning process involved adjusting the model's weights to enhance its performance in answering questions based on a given text context. The dataset contains questions and passages from a variety of topics relevant to Kazakh culture, history, geography, and more, making this model highly specialized for understanding and answering questions in the Kazakh language. ## Intended Use This model is designed for open-domain question-answering tasks in the Kazakh language. It can be used to answer factual questions based on the provided context. It is particularly useful for: - **Kazakh Natural Language Processing (NLP) tasks**: Enhancing applications involving text comprehension, search engines, chatbots, and virtual assistants in the Kazakh language. - **Research and Educational Purposes**: Serving as a benchmark or baseline for further research in Kazakh NLP. ### How to Use You can easily use this model with the Hugging Face `Transformers` library: ```python from transformers import AutoModelForQuestionAnswering, AutoTokenizer import torch # Load the fine-tuned model and tokenizer repo_id = 'nur-dev/roberta-large-kazqad' model = AutoModelForQuestionAnswering.from_pretrained(repo_id) tokenizer = AutoTokenizer.from_pretrained(repo_id) # Define the context and question context = """ Алматы Қазақстанның ең ірі мегаполисі. Алматы – асқақ Тянь-Шань тауы жотасының көкжасыл бауырайынан, Іле Алатауының бөктерінде, Қазақстан Республикасының оңтүстік-шығысында, Еуразия құрлығының орталығында орналасқан қала. Бұл қаланы «қала-бақ» деп те атайды. """ question = "Алматы қаласы Қазақстанның қай бөлігінде орналасқан?" # Tokenize the input inputs = tokenizer.encode_plus( question, context, add_special_tokens=True, return_tensors="pt" ) input_ids = inputs["input_ids"] attention_mask = inputs["attention_mask"] # Perform inference with torch.no_grad(): outputs = model(input_ids=input_ids, attention_mask=attention_mask) start_logits = outputs.start_logits end_logits = outputs.end_logits # Find the answer's start and end position start_index = torch.argmax(start_logits) end_index = torch.argmax(end_logits) # Decode the answer from the context answer = tokenizer.decode(input_ids[0][start_index:end_index + 1]) print(f"Question: {question}") print(f"Answer: {answer}") ``` ## Limitations and Biases • Language Specificity: This model is specifically fine-tuned for the Kazakh language and may not perform well in other languages. • Context Length: The model has limitations with very long contexts, as it is fine-tuned for input lengths up to 512 tokens. • Biases: Like other large pre-trained language models, nur-dev/roberta-large-kazqad may exhibit biases present in its training data. Users should be cautious and critically evaluate the model’s outputs, especially for sensitive applications. ## Model Authors **Name:** Kadyrbek Nurgali - **Email:** nurgaliqadyrbek@gmail.com - **LinkedIn:** [Kadyrbek Nurgali](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/)