keshi87
/

sanksrit-GPT

   emissions: 0.443
 datasets:
 - keshi87/mahabharat.txt
+---
+### Sanskrit GPT
+A small Pretrained **10.65M** parameter model generated from the raw text of the original mahabharat - one of the longest epics in the world.
+#### Model Details
+The vocabulary of the model is composed of the following set
+```
+ अआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहऽािीुूृॄेैॊौ्ॢ
+```
+That's 65 characters.
+#### Tokenizer
+For training the a very basic tokenizer was used. Basically it renders the ascii value to each character without accounting
+for multiple characters in the attention.
+```
+# create a mapping from characters to integers
+stoi = { ch:i for i,ch in enumerate(chars) }
+itos = { i:ch for i,ch in enumerate(chars) }
+def encode(s):
+    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
+def decode(l):
+    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
+```
+Obviously much more advanced tokenization schemes can be explored here but since this was an experiment to learn LLM model pretraining from
+scratch, keeping things really basic.
+#### Training
+Some basic parameters used for training the model were as follows
+```
+n_layer = 6
+n_head = 6
+n_embd = 384
+dropout = 0.2
+learning_rate = 1e-3
+```
+The training epochs were limited to 5000 for this model on a single T4 GPU environment.
+The training time was around 20 mins.
+#### Inferencing
+The models can be used for inferencing following the steps below:
+**Load the model**
+```
+    # init from a model saved in a specific directory
+    checkpoint = torch.load('./out/ckpt.pt', map_location=device)
+    gptconf = GPTConfig(**checkpoint['model_args'])
+    model = GPT(gptconf)
+    state_dict = checkpoint['model']
+    unwanted_prefix = '_orig_mod.'
+    for k,v in list(state_dict.items()):
+        if k.startswith(unwanted_prefix):
+            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
+    model.load_state_dict(state_dict)
+```
+**Sample generation!**
+```
+start_ids = encode(start)
+m_tensor = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
+with torch.no_grad():
+    with ctx:
+        for k in range(num_samples):
+            y = model.generate(m_tensor, max_new_tokens, temperature=temperature, top_k=top_k)
+            print(decode(y[0].tolist()))
+            print('---------------')
+```
+- `num_samples`, `top_k`, `max_new_tokens`, `temperature` etc are all hyper-parameters.
+- `start` is the beginning text or prompt we'd want to provide for text generation.
+#### Samples
+- **Start:** "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः "
+  **Generation:**
+  "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः पुरुषादकः
+  स गतान पराहुर उदायात
+     दुर्यॊधन्य आयुष्यतीति
+  उवाच
+  अश्नव अपि चित्राङ्गदॊ दिशस तव
+     धृतराष्ट्रम अभिमन्त्रिणं सर्व एव च
+  एकचक्राद्य वधॊ दिव्याभिवाद्य च
+     परिक्षीविद एवं शूराः स चेहं बलम
+  मुक्तार्थायं तथा यज्ञैर अग्रतः कुलवर्धनः
+      परत्याच चकुरूणाम इति शरुत्वा वसु वहॊ ऽबरवीत
+  कामायाः स च नामधेया भभूमिः सुताः
+      परतिषेत्य उक्त्वा सॊ ऽनवादितः सह
+  अथ तत्र तथा पुष्पाणा बहुभिर भयात
+      स भार्या महाबाहुर बरह्मणावन अभिषेचय
+  कार्ये तथान्ये महीपालाः संविदाः"
+  **English Translation:**
+  "
+    There are many pious stories told by Krishna Dvaipayana, the man-eater.
+    He said, "Duryodhana will live. Ashnav also shows you the picturesque Dhritarashtra, the minister, and all."
+    "One-wheeled killing and divine salutation. The testers and the heroes are that, and I am the strength."
+    "He is thus the enhancer of the family by sacrifices for the sake of liberation. Hearing that he had returned to the Chakurus, Vasu Vaho 'barvita."
+    "The sons of Kama, whose name was Bhabhumi. He turned around and said he was unplayed with."
+    "Then there were many flowers there. He consecrated his wife, the mighty-armed, as a Brahmin."
+    "At work and other governors contracts."
+  "