Abinaya Mahendiran commited on
Commit
29823bd
1 Parent(s): 5f07433

Updated README

Browse files
Files changed (1) hide show
  1. README.md +69 -1
README.md CHANGED
@@ -1 +1,69 @@
1
- This repository contains the code/model for training GPT-2 from scratch for Tamil.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+
3
+ language: ta
4
+ datasets:
5
+ - oscar
6
+ - IndicNLP
7
+ widget:
8
+ - text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
9
+
10
+ ---
11
+ # GPT2-Tamil
12
+
13
+ This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to pretrain a language model using GPT-2 specifically for Tamil language.
14
+
15
+ ## Setup:
16
+ To setup the project, run the following command,
17
+ ```python
18
+ pip install -r requirements.txt
19
+ ```
20
+
21
+ ## Model:
22
+ Pretrained model on Tamil language using a causal language modeling (CLM) objective.
23
+
24
+ ## Dataset Used:
25
+ The GTP-2 model is trained on [oscar dataset - ta](https://huggingface.co/datasets/oscar)
26
+
27
+ ## Intended uses & limitations:
28
+ You can use the raw model for next sentence prediction, but it's mostly intended to be fine-tuned on a downstream task. See the [model hub](https://huggingface.co/models?filter=gpt) to look for fine-tuned versions on a task that interests you.
29
+
30
+ ## How to pretrain the model:
31
+ To perform training, do the following steps,
32
+
33
+ - Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
34
+ ```python
35
+ >>> export MODEL_DIR=<model_dir>
36
+ ```
37
+ - Create the config.json by running the following command,
38
+ ```python
39
+ >>> python src/create_config.py
40
+ ```
41
+ - Create the tokenizer by running the following command,
42
+ ```python
43
+ >>> python src/train_tokenizer.py
44
+ ```
45
+ - Once the config and tokenizer is created, run the following script to start training the flax model
46
+ ```python
47
+ >>> python scripts/train_gpt2-oscar-tamil.sh
48
+ ```
49
+
50
+ ## How to use:
51
+ To perform language generation using the model, pipeline can be used directly.
52
+
53
+ - First convert the flax model to pytorch using the following command,
54
+ ```python
55
+ python src/convert_flax_to_pytorch.py
56
+ ```
57
+ - Use the following snippet to perform language generation,
58
+ ```python
59
+ >>> from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
60
+ >>> model_name = 'abinayam/gpt-2-tamil'
61
+ >>> model = AutoModelWithLMHead.from_pretrained(model_name)
62
+ >>> tokenizer = AutoTokenizer.from_pretrained(model_name)
63
+ >>> set_seed(42)
64
+ >>> input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
65
+ >>> max_len = 300
66
+ >>> no_seq = 5
67
+ >>> generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
68
+ >>> sequence = generator(input_text, max_length=max_len, num_return_sequences=no_seq)
69
+ ```