How can I feed the "context" with whole Turkish wikipedia in json format

#4
by whatnext - opened

Hi,
Thank you for puplishing your model.
I am new to AI. Is it possible to load "Turkish Wikipedia" in any format into this model?
If it's not possible, what are other options?
Thank you.
Best regards.

Hey @whatnext

The model itself cannot read more than 512 tokens at time, see model config: https://huggingface.co/timpal0l/mdeberta-v3-base-squad2/blob/main/config.json max_position_embeddings.

You can chunk up Turkish Wikipedia, store chunks of it in a vector database and embedd those with an embedding model. Then setup a retriever pipeline to retrieve relevant chunks, then use timpal0l/mdeberta-v3-base-squad2 to extract the answer spans. Good frameworks are e.g Haystack - https://haystack.deepset.ai/.

Hey @timpal0l
Thank you for the information and your guidance. It appears that it's doable with "https://github.com/deepset-ai/haystack" but it's beyond my scope for the time being. Turkish Wikipedia dump is in XML format and it's around 811 MB https://dumps.wikimedia.org/trwiki/latest/trwiki-latest-pages-articles.xml.bz2
I'm also alien to the concepts such as chunking up Wikipedia, storing chunks in a vector database and embeding with an embedding model.
Thank you for the insight.
I'm grateful to your guidance.
Regards.

@timpal0l I found this "Developer-friendly, serverless vector database for AI applications. " https://github.com/lancedb/lancedb

Chat GPT advised me this:

To fine-tune a Question Answering (QA) model like MDE-BERTa on a Turkish Wikipedia dump, you'll need to follow several steps. Keep in mind that fine-tuning requires computational resources, and it's recommended to have access to a GPU for faster training. Here's a general guide:

Prepare your Data:

Obtain a Turkish Wikipedia dump. You can find dumps on Wikimedia dumps (https://dumps.wikimedia.org/). Choose the version that suits your needs and download the corresponding XML file.
Extract relevant text from the dump. You may use tools like WikiExtractor (https://github.com/attardi/wikiextractor) to convert the XML dump into plain text.
Install Libraries:

Install the necessary libraries, including Hugging Face Transformers and TensorFlow or PyTorch, depending on the backend used by the model.
bash
Copy code
pip install transformers
Fine-tune the Model:

Write a script to fine-tune the QA model on your Turkish Wikipedia data. You can use the run_squad.py script provided by Hugging Face Transformers.
bash
Copy code
python run_squad.py
--model_name_or_path timpal0l/mdeberta-v3-base-squad2
--train_file path/to/your/train_dataset.json
--validation_file path/to/your/dev_dataset.json
--output_dir path/to/save/fine_tuned_model
--per_device_train_batch_size 4
--num_train_epochs 3
--save_steps 1000
--overwrite_output_dir
--overwrite_cache
--do_train
--do_eval
--version_2_with_negative
Make sure your train and dev datasets are in the SQuAD format (JSON format with questions, contexts, and answers).
Evaluate and Use the Fine-Tuned Model:

After fine-tuning, you can evaluate the model on a test set:
bash
Copy code
python run_squad.py
--model_name_or_path path/to/saved_fine_tuned_model
--do_eval
--validation_file path/to/your/test_dataset.json
--per_device_eval_batch_size 4
--overwrite_cache
Use the model for inference:
python
Copy code
from transformers import pipeline, BertForQuestionAnswering, BertTokenizerFast

model = BertForQuestionAnswering.from_pretrained("path/to/saved_fine_tuned_model")
tokenizer = BertTokenizerFast.from_pretrained("path/to/saved_fine_tuned_model")

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

result = qa_pipeline({
"question": "Your question here?",
"context": "The context from which to extract the answer."
})

print(result)
Remember to replace placeholders like path/to/your/train_dataset.json, path/to/saved_fine_tuned_model, etc., with the actual paths in your system. Fine-tuning requires careful tuning of hyperparameters and might take some time depending on your hardware resources. Adjust parameters like per_device_train_batch_size, num_train_epochs, etc., based on your specific requirements.

Sign up or log in to comment