Abinaya Mahendiran commited on
Commit
39d6806
1 Parent(s): 18f1d33

Updated README

Browse files
Files changed (1) hide show
  1. README.md +26 -23
README.md CHANGED
@@ -1,20 +1,18 @@
1
- # GPT2-Tamil
2
-
3
- This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to train a language model using GPT-2 specifically for Tamil language.
4
 
5
- language:
6
- - ta
7
- tags:
8
- - text-generation
9
  license: MIT
10
  datasets:
11
  - OSCAR
12
  - IndicNLP
13
- metrics:
14
- - Preplexity
15
  widget:
16
  - text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
17
 
 
 
 
 
 
18
  ## Setup:
19
  To setup the project, run the following command,
20
  ``` pip install -r requirements.txt
@@ -27,32 +25,37 @@ The GTP-2 model is trained using OSCAR (Tamil) and IndicNLP (Tamil) dataset
27
  To perform training, do the following steps,
28
 
29
  - Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
30
- ```export MODEL_DIR=<model_dir>
 
31
  ```
32
  - Create the config.json by running the following command,
33
- ```python src/create_config.py
 
34
  ```
35
  - Create the tokenizer by running the following command,
36
- ```python src/train_tokenizer.py
 
37
  ```
38
  - Once the config and tokenizer is created, run the following script to start training the flax model
39
- ```python scripts/train_gpt2-oscar-tamil.sh
 
40
  ```
41
 
42
  ## Inference:
43
- To perform language generation using the model,
44
 
45
  - First convert the flax model to pytorch using the following command,
46
- ```python src/convert_flax_to_pytorch.py
 
47
  ```
48
  - Use the following snippet to perform language generation,
49
  ```
50
- from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
51
- model_name = 'abinayam/gpt-2-tamil'
52
- model = AutoModelWithLMHead.from_pretrained(model_name)
53
- tokenizer = AutoTokenizer.from_pretrained(model_name)
54
- input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
55
- max_len = 300
56
- generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
57
- sequence = generator(input_text, max_length=max_len)
58
  ```
 
1
+ ---
 
 
2
 
3
+ language: ta
 
 
 
4
  license: MIT
5
  datasets:
6
  - OSCAR
7
  - IndicNLP
 
 
8
  widget:
9
  - text: 'ஒரு ஊரிலே ஒரு காக்கைக்கு'
10
 
11
+ ---
12
+ # GPT2-Tamil
13
+
14
+ This repository is created as part of the Flax/Jax community week by Huggingface. The aim of this project is to train a language model using GPT-2 specifically for Tamil language.
15
+
16
  ## Setup:
17
  To setup the project, run the following command,
18
  ``` pip install -r requirements.txt
 
25
  To perform training, do the following steps,
26
 
27
  - Export the model directory (where you want to store the model artifacts like config, tokenizer, etc.)
28
+ ```
29
+ export MODEL_DIR=<model_dir>
30
  ```
31
  - Create the config.json by running the following command,
32
+ ```
33
+ python src/create_config.py
34
  ```
35
  - Create the tokenizer by running the following command,
36
+ ```
37
+ python src/train_tokenizer.py
38
  ```
39
  - Once the config and tokenizer is created, run the following script to start training the flax model
40
+ ```
41
+ python scripts/train_gpt2-oscar-tamil.sh
42
  ```
43
 
44
  ## Inference:
45
+ To perform language generation using the model, pipeline can be used directly.
46
 
47
  - First convert the flax model to pytorch using the following command,
48
+ ```
49
+ python src/convert_flax_to_pytorch.py
50
  ```
51
  - Use the following snippet to perform language generation,
52
  ```
53
+ from transformers import AutoTokenizer, AutoModelWithLMHead, pipeline
54
+ model_name = 'abinayam/gpt-2-tamil'
55
+ model = AutoModelWithLMHead.from_pretrained(model_name)
56
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
57
+ input_text = "ஒரு ஊரிலே ஒரு காக்கைக்கு"
58
+ max_len = 300
59
+ generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
60
+ sequence = generator(input_text, max_length=max_len)
61
  ```