Updated Readme.
Browse files
README.md
CHANGED
@@ -12,4 +12,127 @@ co2_eq_emissions:
|
|
12 |
emissions: 0.443
|
13 |
datasets:
|
14 |
- keshi87/mahabharat.txt
|
15 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
emissions: 0.443
|
13 |
datasets:
|
14 |
- keshi87/mahabharat.txt
|
15 |
+
---
|
16 |
+
|
17 |
+
### Sanskrit GPT
|
18 |
+
|
19 |
+
A small Pretrained **10.65M** parameter model generated from the raw text of the original mahabharat - one of the longest epics in the world.
|
20 |
+
|
21 |
+
|
22 |
+
#### Model Details
|
23 |
+
|
24 |
+
The vocabulary of the model is composed of the following set
|
25 |
+
|
26 |
+
```
|
27 |
+
अआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहऽािीुूृॄेैॊौ्ॢ
|
28 |
+
|
29 |
+
```
|
30 |
+
That's 65 characters.
|
31 |
+
|
32 |
+
#### Tokenizer
|
33 |
+
|
34 |
+
For training the a very basic tokenizer was used. Basically it renders the ascii value to each character without accounting
|
35 |
+
for multiple characters in the attention.
|
36 |
+
|
37 |
+
```
|
38 |
+
# create a mapping from characters to integers
|
39 |
+
stoi = { ch:i for i,ch in enumerate(chars) }
|
40 |
+
itos = { i:ch for i,ch in enumerate(chars) }
|
41 |
+
def encode(s):
|
42 |
+
return [stoi[c] for c in s] # encoder: take a string, output a list of integers
|
43 |
+
def decode(l):
|
44 |
+
return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
|
45 |
+
```
|
46 |
+
|
47 |
+
Obviously much more advanced tokenization schemes can be explored here but since this was an experiment to learn LLM model pretraining from
|
48 |
+
scratch, keeping things really basic.
|
49 |
+
|
50 |
+
#### Training
|
51 |
+
|
52 |
+
Some basic parameters used for training the model were as follows
|
53 |
+
|
54 |
+
```
|
55 |
+
n_layer = 6
|
56 |
+
n_head = 6
|
57 |
+
n_embd = 384
|
58 |
+
dropout = 0.2
|
59 |
+
learning_rate = 1e-3
|
60 |
+
```
|
61 |
+
The training epochs were limited to 5000 for this model on a single T4 GPU environment.
|
62 |
+
The training time was around 20 mins.
|
63 |
+
|
64 |
+
#### Inferencing
|
65 |
+
|
66 |
+
The models can be used for inferencing following the steps below:
|
67 |
+
|
68 |
+
**Load the model**
|
69 |
+
|
70 |
+
```
|
71 |
+
# init from a model saved in a specific directory
|
72 |
+
checkpoint = torch.load('./out/ckpt.pt', map_location=device)
|
73 |
+
gptconf = GPTConfig(**checkpoint['model_args'])
|
74 |
+
model = GPT(gptconf)
|
75 |
+
state_dict = checkpoint['model']
|
76 |
+
unwanted_prefix = '_orig_mod.'
|
77 |
+
for k,v in list(state_dict.items()):
|
78 |
+
if k.startswith(unwanted_prefix):
|
79 |
+
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
|
80 |
+
model.load_state_dict(state_dict)
|
81 |
+
```
|
82 |
+
**Sample generation!**
|
83 |
+
|
84 |
+
```
|
85 |
+
start_ids = encode(start)
|
86 |
+
m_tensor = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
|
87 |
+
with torch.no_grad():
|
88 |
+
with ctx:
|
89 |
+
for k in range(num_samples):
|
90 |
+
y = model.generate(m_tensor, max_new_tokens, temperature=temperature, top_k=top_k)
|
91 |
+
print(decode(y[0].tolist()))
|
92 |
+
print('---------------')
|
93 |
+
```
|
94 |
+
|
95 |
+
- `num_samples`, `top_k`, `max_new_tokens`, `temperature` etc are all hyper-parameters.
|
96 |
+
- `start` is the beginning text or prompt we'd want to provide for text generation.
|
97 |
+
|
98 |
+
#### Samples
|
99 |
+
|
100 |
+
- **Start:** "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः "
|
101 |
+
**Generation:**
|
102 |
+
"कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः पुरुषादकः
|
103 |
+
स गतान पराहुर उदायात
|
104 |
+
दुर्यॊधन्य आयुष्यतीति
|
105 |
+
उवाच
|
106 |
+
अश्नव अपि चित्राङ्गदॊ दिशस तव
|
107 |
+
धृतराष्ट्रम अभिमन्त्रिणं सर्व एव च
|
108 |
+
एकचक्राद्य वधॊ दिव्याभिवाद्य च
|
109 |
+
परिक्षीविद एवं शूराः स चेहं बलम
|
110 |
+
|
111 |
+
मुक्तार्थायं तथा यज्ञैर अग्रतः कुलवर्धनः
|
112 |
+
परत्याच चकुरूणाम इति शरुत्वा वसु वहॊ ऽबरवीत
|
113 |
+
|
114 |
+
कामायाः स च नामधेया भभूमिः सुताः
|
115 |
+
परतिषेत्य उक्त्वा सॊ ऽनवादितः सह
|
116 |
+
|
117 |
+
अथ तत्र तथा पुष्पाणा बहुभिर भयात
|
118 |
+
स भार्या महाबाहुर बरह्मणावन अभिषेचय
|
119 |
+
|
120 |
+
कार्ये तथान्ये महीपालाः संविदाः"
|
121 |
+
**English Translation:**
|
122 |
+
"
|
123 |
+
There are many pious stories told by Krishna Dvaipayana, the man-eater.
|
124 |
+
|
125 |
+
He said, "Duryodhana will live. Ashnav also shows you the picturesque Dhritarashtra, the minister, and all."
|
126 |
+
|
127 |
+
"One-wheeled killing and divine salutation. The testers and the heroes are that, and I am the strength."
|
128 |
+
|
129 |
+
"He is thus the enhancer of the family by sacrifices for the sake of liberation. Hearing that he had returned to the Chakurus, Vasu Vaho 'barvita."
|
130 |
+
|
131 |
+
"The sons of Kama, whose name was Bhabhumi. He turned around and said he was unplayed with."
|
132 |
+
|
133 |
+
"Then there were many flowers there. He consecrated his wife, the mighty-armed, as a Brahmin."
|
134 |
+
|
135 |
+
"At work and other governors contracts."
|
136 |
+
"
|
137 |
+
|
138 |
+
|