|
--- |
|
license: cc-by-4.0 |
|
language: |
|
- sa |
|
- en |
|
tags: |
|
- langugae |
|
- sanksrit |
|
- gpt |
|
- llm |
|
co2_eq_emissions: |
|
emissions: 0.443 |
|
datasets: |
|
- keshi87/mahabharat.txt |
|
--- |
|
|
|
### nano-MahaGPT |
|
|
|
A small Pretrained **10.65M** parameter model generated from the raw text of the original mahabharat - one of the longest epics in the world. |
|
|
|
|
|
#### Model Details |
|
|
|
The vocabulary of the model is composed of the following set |
|
|
|
``` |
|
अआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहऽािीुूृॄेैॊौ्ॢ |
|
|
|
``` |
|
That's 65 characters. |
|
|
|
#### Tokenizer |
|
|
|
For training a very basic tokenizer was used. Basically it renders the ascii value to each character without accounting |
|
for multiple characters in the attention. |
|
|
|
``` |
|
# create a mapping from characters to integers |
|
stoi = { ch:i for i,ch in enumerate(chars) } |
|
itos = { i:ch for i,ch in enumerate(chars) } |
|
def encode(s): |
|
return [stoi[c] for c in s] # encoder: take a string, output a list of integers |
|
def decode(l): |
|
return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string |
|
``` |
|
|
|
Obviously much more advanced tokenization schemes can be explored here but since this was an experiment to learn LLM model pretraining from |
|
scratch, keeping things really basic. |
|
|
|
#### Training |
|
|
|
Some basic parameters used for training the model were as follows |
|
|
|
``` |
|
n_layer = 6 |
|
n_head = 6 |
|
n_embd = 384 |
|
dropout = 0.2 |
|
learning_rate = 1e-3 |
|
``` |
|
The training epochs were limited to 5000 for this model on a single T4 GPU environment. |
|
The training time was around 20 mins. |
|
|
|
#### Inferencing |
|
|
|
The models can be used for inferencing following the steps below: |
|
|
|
**Load the model** |
|
|
|
``` |
|
# init from a model saved in a specific directory |
|
checkpoint = torch.load('./out/ckpt.pt', map_location=device) |
|
gptconf = GPTConfig(**checkpoint['model_args']) |
|
model = GPT(gptconf) |
|
state_dict = checkpoint['model'] |
|
unwanted_prefix = '_orig_mod.' |
|
for k,v in list(state_dict.items()): |
|
if k.startswith(unwanted_prefix): |
|
state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k) |
|
model.load_state_dict(state_dict) |
|
``` |
|
**Sample generation!** |
|
|
|
``` |
|
start_ids = encode(start) |
|
m_tensor = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...]) |
|
with torch.no_grad(): |
|
with ctx: |
|
for k in range(num_samples): |
|
y = model.generate(m_tensor, max_new_tokens, temperature=temperature, top_k=top_k) |
|
print(decode(y[0].tolist())) |
|
print('---------------') |
|
``` |
|
|
|
- `num_samples`, `top_k`, `max_new_tokens`, `temperature` etc are all hyper-parameters. |
|
- `start` is the beginning text or prompt we'd want to provide for text generation. |
|
|
|
#### Samples |
|
|
|
- **Start:** "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः " |
|
- **Generation:** |
|
"कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः पुरुषादकः |
|
स गतान पराहुर उदायात |
|
दुर्यॊधन्य आयुष्यतीति |
|
उवाच |
|
अश्नव अपि चित्राङ्गदॊ दिशस तव |
|
धृतराष्ट्रम अभिमन्त्रिणं सर्व एव च |
|
एकचक्राद्य वधॊ दिव्याभिवाद्य च |
|
परिक्षीविद एवं शूराः स चेहं बलम |
|
|
|
मुक्तार्थायं तथा यज्ञैर अग्रतः कुलवर्धनः |
|
परत्याच चकुरूणाम इति शरुत्वा वसु वहॊ ऽबरवीत |
|
|
|
कामायाः स च नामधेया भभूमिः सुताः |
|
परतिषेत्य उक्त्वा सॊ ऽनवादितः सह |
|
|
|
अथ तत्र तथा पुष्पाणा बहुभिर भयात |
|
स भार्या महाबाहुर बरह्मणावन अभिषेचय |
|
|
|
कार्ये तथान्ये महीपालाः संविदाः" |
|
|
|
- **English Translation:** |
|
" |
|
There are many pious stories told by Krishna Dvaipayana, *the man-eater*. |
|
|
|
He said, "Duryodhana will live. Ashnav also shows you the picturesque Dhritarashtra, the minister, and all." |
|
|
|
"One-wheeled killing and divine salutation. The testers and the heroes are that, and I am the strength." |
|
|
|
"He is thus the enhancer of the family by sacrifices for the sake of liberation. Hearing that he had returned to the Chakurus, Vasu Vaho 'barvita." |
|
|
|
"The sons of Kama, whose name was Bhabhumi. He turned around and said he was unplayed with." |
|
|
|
"Then there were many flowers there. He consecrated his wife, the mighty-armed, as a Brahmin." |
|
|
|
"At work and other governors contracts." |
|
" |
|
|
|
|
|
|