File size: 2,427 Bytes
39d6806 900235d 39d6806 18f1d33 80d39c5 18f1d33 39d6806 80d39c5 39d6806 18f1d33 80d39c5 18f1d33 80d39c5 0388313 80d39c5 900235d 18f1d33 80d39c5 0388313 752d5c1 18f1d33 80d39c5 18f1d33 900235d 18f1d33 80d39c5 18f1d33 80d39c5 18f1d33 80d39c5 18f1d33 80d39c5 18f1d33 900235d 80d39c5 39d6806 900235d 18f1d33 80d39c5 18f1d33 80d39c5 0388313 18f1d33 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
language: ta
datasets:
- oscar
- IndicNLP
widget:
- text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
---
# GPT2-Tamil
This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to pretrain a language model using GPT-2 specifically for Tamil language.
## Setup:
To setup the project, run the following command,
```python
pip install -r requirements.txt
```
## Model:
Pretrained model on Tamil language using a causal language modeling (CLM) objective.
## Dataset Used:
The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar) and [IndicNLP dataset - ta](https://indicnlp.ai4bharat.org/corpora/)
## Intended uses & limitations:
You can use the raw model for next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=gpt2) to look for fine-tuned versions on a task that interests you.
## How to pretrain the model:
To perform training, do the following steps,
- Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
```python
>>> export MODEL_DIR=<model_dir>
```
- Create the config.json by running the following command,
```python
>>> python src/create_config.py
```
- Create the tokenizer by running the following command,
```python
>>> python src/train_tokenizer.py
```
- Once the config and tokenizer is created, run the following script to start training the flax model
```python
>>> python scripts/train_gpt2-oscar-tamil.sh
```
## How to use:
To perform language generation using the model, pipeline can be used directly.
- First convert the flax model to pytorch using the following command,
```python
python src/convert_flax_to_pytorch.py
```
- Use the following snippet to perform language generation,
```python
>>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
>>> model_name = 'abinayam/gpt-2-tamil'
>>> model = AutoModelWithLMHead.from_pretrained(model_name)
>>> tokenizer = AutoTokenizer.from_pretrained(model_name)
>>> set_seed(42)
>>> input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
>>> max_len = 300
>>> no_seq = 5
>>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
>>> sequence = generator(input_text, max_length=max_len, num_return_sequences=no_seq)
```
|