Model Overview

The qqp_kz model is paraphrasing tool tailored for the Kazakh language. It is built upon the humarin/chatgpt_paraphraser_on_T5_base model, inheriting its robust architecture and adapting it for the nuances of Kazakh.

Key Features:

  • Language: Specifically designed for paraphrasing in Kazakh.
  • Base Model: Derived from chatgpt_paraphraser_on_T5_base, a proven model in paraphrasing tasks.
  • Tokenizer: Utilizes CCRss/tokenizer_t5_kz for optimal Kazakh language processing.

Data Preprocessing The dataset used for training the qqp_kz model undergoes rigorous preprocessing to ensure compatibility and optimal performance:

# Importing necessary modules from the transformers library
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# Initializing the tokenizer for the specific model. This tokenizer is used to convert
# text input into a format that is understandable by the model.
tokenizer = AutoTokenizer.from_pretrained("CCRss/tokenizer_t5_kz")

# Define a function for preprocessing the data. This function takes an example
# (which includes source and target texts) and tokenizes both texts using the tokenizer.
# The tokenized output is then formatted to a fixed length for consistent model input.
def preprocess_data(example):
    # Extracting the source and target texts from the example
    source = example["src"]
    target = example["trg"]
    
    # Tokenizing the source text with padding and truncation to ensure a fixed length
    source_inputs = tokenizer(source, padding="max_length", truncation=True, max_length=128)
    
    # Tokenizing the target text with padding and truncation to ensure a fixed length
    target_inputs = tokenizer(target, padding="max_length", truncation=True, max_length=128)
    
    # Returning the tokenized inputs, combining both source and target, and setting the target as labels
    return {**source_inputs, **target_inputs, "labels": target_inputs["input_ids"]}

# Applying the preprocessing function to the dataset, effectively transforming all text data
# into a tokenized format suitable for the Seq2Seq model.
encoded_dataset = dataset.map(preprocess_data)
# Setting the format of the dataset to PyTorch tensors for compatibility with the training framework.
encoded_dataset.set_format("torch")

Model Training

The model is trained with the following configuration:


# Importing necessary classes for training from the transformers library
from transformers import TrainingArguments, Seq2SeqTrainer

# Name of the pretrained model to be used for Seq2Seq learning
name_of_model = "humarin/chatgpt_paraphraser_on_T5_base"
# Loading the model from the pretrained weights
model = AutoModelForSeq2SeqLM.from_pretrained(name_of_model)

# Setting up training arguments. This includes batch size, learning rate, number of epochs,
# directories for saving results and logs, and evaluation strategy.
training_args = Seq2SeqTrainingArguments(
    per_device_train_batch_size=21,
    gradient_accumulation_steps=3,
    learning_rate=5e-5,
    save_steps=2000,
    num_train_epochs=3,
    output_dir='./results',
    logging_dir='./logs',
    logging_steps=2000,
    eval_steps=2000,
    evaluation_strategy="steps"
)

# Initializing the trainer with the model, training arguments, and the datasets for training and evaluation.
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['valid']
)

# Starting the training process of the model using the specified datasets and training arguments.
trainer.train()

Usage

The qqp_kz model is specifically designed for paraphrasing in the Kazakh language. It is highly suitable for a variety of NLP tasks such as content creation, enhancing translations, and linguistic research.

To utilize the model:

  • Install the transformers library.
  • Load the model using the Hugging Face API.
  • Input your Kazakh text for paraphrasing.

Example Deployment

For a practical demonstration of the model in action, please refer to our Google Colab notebook. This notebook provides a comprehensive example of how to infer with the qqp_kz model.

Contributions and Feedback

We welcome contributions to the qqp_kz model. If you have suggestions, improvements, or encounter any issues, please feel free to open an issue in the repository.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train CCRss/qqp_kz