How can I feed the "context" with whole Turkish wikipedia in json format

by whatnext - opened Feb 16, 2024

Feb 16, 2024

•

edited Feb 16, 2024

Hi,
Thank you for puplishing your model.
I am new to AI. Is it possible to load "Turkish Wikipedia" in any format into this model?
If it's not possible, what are other options?
Thank you.
Best regards.

timpal0l

Owner Feb 16, 2024

Hey @whatnext

The model itself cannot read more than 512 tokens at time, see model config: https://huggingface.co/timpal0l/mdeberta-v3-base-squad2/blob/main/config.json max_position_embeddings.

You can chunk up Turkish Wikipedia, store chunks of it in a vector database and embedd those with an embedding model. Then setup a retriever pipeline to retrieve relevant chunks, then use timpal0l/mdeberta-v3-base-squad2 to extract the answer spans. Good frameworks are e.g Haystack - https://haystack.deepset.ai/.

whatnext

Feb 16, 2024

Hey @timpal0l
Thank you for the information and your guidance. It appears that it's doable with "https://github.com/deepset-ai/haystack" but it's beyond my scope for the time being. Turkish Wikipedia dump is in XML format and it's around 811 MB https://dumps.wikimedia.org/trwiki/latest/trwiki-latest-pages-articles.xml.bz2
I'm also alien to the concepts such as chunking up Wikipedia, storing chunks in a vector database and embeding with an embedding model.
Thank you for the insight.
I'm grateful to your guidance.
Regards.

whatnext

Feb 17, 2024

@timpal0l I found this "Developer-friendly, serverless vector database for AI applications. " https://github.com/lancedb/lancedb

whatnext

Feb 19, 2024

Chat GPT advised me this:

To fine-tune a Question Answering (QA) model like MDE-BERTa on a Turkish Wikipedia dump, you'll need to follow several steps. Keep in mind that fine-tuning requires computational resources, and it's recommended to have access to a GPU for faster training. Here's a general guide:

Prepare your Data:

Obtain a Turkish Wikipedia dump. You can find dumps on Wikimedia dumps (https://dumps.wikimedia.org/). Choose the version that suits your needs and download the corresponding XML file.
Extract relevant text from the dump. You may use tools like WikiExtractor (https://github.com/attardi/wikiextractor) to convert the XML dump into plain text.
Install Libraries:

Install the necessary libraries, including Hugging Face Transformers and TensorFlow or PyTorch, depending on the backend used by the model.
bash
Copy code
pip install transformers
Fine-tune the Model:

Write a script to fine-tune the QA model on your Turkish Wikipedia data. You can use the run_squad.py script provided by Hugging Face Transformers.
bash
Copy code
python run_squad.py
--model_name_or_path timpal0l/mdeberta-v3-base-squad2
--train_file path/to/your/train_dataset.json
--validation_file path/to/your/dev_dataset.json
--output_dir path/to/save/fine_tuned_model
--per_device_train_batch_size 4
--num_train_epochs 3
--save_steps 1000
--overwrite_output_dir
--overwrite_cache
--do_train
--do_eval
--version_2_with_negative
Make sure your train and dev datasets are in the SQuAD format (JSON format with questions, contexts, and answers).
Evaluate and Use the Fine-Tuned Model:

After fine-tuning, you can evaluate the model on a test set:
bash
Copy code
python run_squad.py
--model_name_or_path path/to/saved_fine_tuned_model
--do_eval
--validation_file path/to/your/test_dataset.json
--per_device_eval_batch_size 4
--overwrite_cache
Use the model for inference:
python
Copy code
from transformers import pipeline, BertForQuestionAnswering, BertTokenizerFast

model = BertForQuestionAnswering.from_pretrained("path/to/saved_fine_tuned_model")
tokenizer = BertTokenizerFast.from_pretrained("path/to/saved_fine_tuned_model")

qa_pipeline = pipeline("question-answering", model=model, tokenizer=tokenizer)

result = qa_pipeline({
"question": "Your question here?",
"context": "The context from which to extract the answer."
})

print(result)
Remember to replace placeholders like path/to/your/train_dataset.json, path/to/saved_fine_tuned_model, etc., with the actual paths in your system. Fine-tuning requires careful tuning of hyperparameters and might take some time depending on your hardware resources. Adjust parameters like per_device_train_batch_size, num_train_epochs, etc., based on your specific requirements.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment