nur-dev's picture
Update README.md
ae1645e verified
---
license: afl-3.0
datasets:
- issai/kazqad
language:
- kk
library_name: transformers
pipeline_tag: question-answering
base_model: nur-dev/roberta-kaz-large
---
# RoBERTa-Large-KazQAD for Question Answering
## Model Description
**RoBERTa-Large-KazQAD** is a fine-tuned version of [RoBERTa-Kaz-Large](https://huggingface.co/nur-dev/roberta-kaz-large), specifically optimized for the Question Answering (QA) task using the Kazakh Open-Domain Question Answering Dataset (KazQAD). This model is trained to extract precise answers from given contexts in the Kazakh language.
### Fine-Tuning Details
This model was fine-tuned on the KazQAD dataset, which is a Kazakh open-domain question-answering dataset. The fine-tuning process involved adjusting the model's weights to enhance its performance in answering questions based on a given text context. The dataset contains questions and passages from a variety of topics relevant to Kazakh culture, history, geography, and more, making this model highly specialized for understanding and answering questions in the Kazakh language.
## Intended Use
This model is designed for open-domain question-answering tasks in the Kazakh language. It can be used to answer factual questions based on the provided context. It is particularly useful for:
- **Kazakh Natural Language Processing (NLP) tasks**: Enhancing applications involving text comprehension, search engines, chatbots, and virtual assistants in the Kazakh language.
- **Research and Educational Purposes**: Serving as a benchmark or baseline for further research in Kazakh NLP.
### How to Use
You can easily use this model with the Hugging Face `Transformers` library:
```python
from transformers import AutoModelForQuestionAnswering, AutoTokenizer
import torch
# Load the fine-tuned model and tokenizer
repo_id = 'nur-dev/roberta-large-kazqad'
model = AutoModelForQuestionAnswering.from_pretrained(repo_id)
tokenizer = AutoTokenizer.from_pretrained(repo_id)
# Define the context and question
context = """
Алматы Қазақстанның ең ірі мегаполисі. Алматы – асқақ Тянь-Шань тауы жотасының көкжасыл бауырайынан,
Іле Алатауының бөктерінде, Қазақстан Республикасының оңтүстік-шығысында, Еуразия құрлығының орталығында орналасқан қала.
Бұл қаланы «қала-бақ» деп те атайды.
"""
question = "Алматы қаласы Қазақстанның қай бөлігінде орналасқан?"
# Tokenize the input
inputs = tokenizer.encode_plus(
question,
context,
add_special_tokens=True,
return_tensors="pt"
)
input_ids = inputs["input_ids"]
attention_mask = inputs["attention_mask"]
# Perform inference
with torch.no_grad():
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
# Find the answer's start and end position
start_index = torch.argmax(start_logits)
end_index = torch.argmax(end_logits)
# Decode the answer from the context
answer = tokenizer.decode(input_ids[0][start_index:end_index + 1])
print(f"Question: {question}")
print(f"Answer: {answer}")
```
## Limitations and Biases
• Language Specificity: This model is specifically fine-tuned for the Kazakh language and may not perform well in other languages.
• Context Length: The model has limitations with very long contexts, as it is fine-tuned for input lengths up to 512 tokens.
• Biases: Like other large pre-trained language models, nur-dev/roberta-large-kazqad may exhibit biases present in its training data. Users should be cautious and critically evaluate the model’s outputs, especially for sensitive applications.
## Model Authors
**Name:** Kadyrbek Nurgali
- **Email:** nurgaliqadyrbek@gmail.com
- **LinkedIn:** [Kadyrbek Nurgali](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/)