GPT2-Tamil

This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to train a language model using GPT-2 specifically for Tamil language.

language:

ta tags:
text-generation license: MIT datasets:
OSCAR
IndicNLP metrics:
Preplexity widget:
text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'

Setup:

To setup the project, run the following command,

Dataset Used:

The GTP-2 model is trained using OSCAR (Tamil) and IndicNLP (Tamil) dataset

Train model:

To perform training, do the following steps,

Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)

Create the config.json by running the following command,

Create the tokenizer by running the following command,

Once the config and tokenizer is created, run the following script to start training the flax model

Inference:

To perform language generation using the model,

First convert the flax model to pytorch using the following command,

Use the following snippet to perform language generation,

from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
model_name = 'abinayam/gpt-2-tamil'
model = AutoModelWithLMHead.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
max_len = 300
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
sequence = generator(input_text, max_length=max_len)