|
--- |
|
license: afl-3.0 |
|
datasets: |
|
- issai/kazqad |
|
language: |
|
- kk |
|
library_name: transformers |
|
pipeline_tag: question-answering |
|
base_model: nur-dev/roberta-kaz-large |
|
--- |
|
|
|
# RoBERTa-Large-KazQAD for Question Answering |
|
|
|
## Model Description |
|
**RoBERTa-Large-KazQAD** is a fine-tuned version of [RoBERTa-Kaz-Large](https://huggingface.co/nur-dev/roberta-kaz-large), specifically optimized for the Question Answering (QA) task using the Kazakh Open-Domain Question Answering Dataset (KazQAD). This model is trained to extract precise answers from given contexts in the Kazakh language. |
|
|
|
### Fine-Tuning Details |
|
|
|
This model was fine-tuned on the KazQAD dataset, which is a Kazakh open-domain question-answering dataset. The fine-tuning process involved adjusting the model's weights to enhance its performance in answering questions based on a given text context. The dataset contains questions and passages from a variety of topics relevant to Kazakh culture, history, geography, and more, making this model highly specialized for understanding and answering questions in the Kazakh language. |
|
|
|
## Intended Use |
|
|
|
This model is designed for open-domain question-answering tasks in the Kazakh language. It can be used to answer factual questions based on the provided context. It is particularly useful for: |
|
|
|
- **Kazakh Natural Language Processing (NLP) tasks**: Enhancing applications involving text comprehension, search engines, chatbots, and virtual assistants in the Kazakh language. |
|
- **Research and Educational Purposes**: Serving as a benchmark or baseline for further research in Kazakh NLP. |
|
|
|
### How to Use |
|
|
|
You can easily use this model with the Hugging Face `Transformers` library: |
|
```python |
|
from transformers import AutoModelForQuestionAnswering, AutoTokenizer |
|
import torch |
|
|
|
# Load the fine-tuned model and tokenizer |
|
repo_id = 'nur-dev/roberta-large-kazqad' |
|
model = AutoModelForQuestionAnswering.from_pretrained(repo_id) |
|
tokenizer = AutoTokenizer.from_pretrained(repo_id) |
|
|
|
# Define the context and question |
|
context = """ |
|
Алматы Қазақстанның ең ірі мегаполисі. Алматы – асқақ Тянь-Шань тауы жотасының көкжасыл бауырайынан, |
|
Іле Алатауының бөктерінде, Қазақстан Республикасының оңтүстік-шығысында, Еуразия құрлығының орталығында орналасқан қала. |
|
Бұл қаланы «қала-бақ» деп те атайды. |
|
""" |
|
question = "Алматы қаласы Қазақстанның қай бөлігінде орналасқан?" |
|
|
|
# Tokenize the input |
|
inputs = tokenizer.encode_plus( |
|
question, |
|
context, |
|
add_special_tokens=True, |
|
return_tensors="pt" |
|
) |
|
|
|
input_ids = inputs["input_ids"] |
|
attention_mask = inputs["attention_mask"] |
|
|
|
# Perform inference |
|
with torch.no_grad(): |
|
outputs = model(input_ids=input_ids, attention_mask=attention_mask) |
|
start_logits = outputs.start_logits |
|
end_logits = outputs.end_logits |
|
|
|
# Find the answer's start and end position |
|
start_index = torch.argmax(start_logits) |
|
end_index = torch.argmax(end_logits) |
|
|
|
# Decode the answer from the context |
|
answer = tokenizer.decode(input_ids[0][start_index:end_index + 1]) |
|
|
|
print(f"Question: {question}") |
|
print(f"Answer: {answer}") |
|
``` |
|
|
|
## Limitations and Biases |
|
• Language Specificity: This model is specifically fine-tuned for the Kazakh language and may not perform well in other languages. |
|
• Context Length: The model has limitations with very long contexts, as it is fine-tuned for input lengths up to 512 tokens. |
|
• Biases: Like other large pre-trained language models, nur-dev/roberta-large-kazqad may exhibit biases present in its training data. Users should be cautious and critically evaluate the model’s outputs, especially for sensitive applications. |
|
|
|
## Model Authors |
|
|
|
**Name:** Kadyrbek Nurgali |
|
- **Email:** nurgaliqadyrbek@gmail.com |
|
- **LinkedIn:** [Kadyrbek Nurgali](https://www.linkedin.com/in/nurgali-kadyrbek-504260231/) |