File size: 5,062 Bytes
2b91499
 
ccc2ec2
 
 
 
 
 
 
 
5699c1c
ccc2ec2
5699c1c
 
19c9145
 
288adba
19c9145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f49cc90
19c9145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
955bdc3
19c9145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
955bdc3
 
 
19c9145
955bdc3
19c9145
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
---
license: cc-by-4.0
language:
- sa
- en
tags:
- langugae
- sanksrit
- gpt
- llm
co2_eq_emissions:
  emissions: 0.443
datasets:
- keshi87/mahabharat.txt
---

### nano-MahaGPT

A small Pretrained **10.65M** parameter model generated from the raw text of the original mahabharat - one of the longest epics in the world. 


#### Model Details

The vocabulary of the model is composed of the following set

```
 अआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहऽािीुूृॄेैॊौ्ॢ

```
That's 65 characters. 

#### Tokenizer

For training a very basic tokenizer was used. Basically it renders the ascii value to each character without accounting 
for multiple characters in the attention. 

```
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars) }
itos = { i:ch for i,ch in enumerate(chars) }
def encode(s):
    return [stoi[c] for c in s] # encoder: take a string, output a list of integers
def decode(l):
    return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
```

Obviously much more advanced tokenization schemes can be explored here but since this was an experiment to learn LLM model pretraining from
scratch, keeping things really basic. 

#### Training 

Some basic parameters used for training the model were as follows

```
n_layer = 6
n_head = 6
n_embd = 384
dropout = 0.2
learning_rate = 1e-3
```
The training epochs were limited to 5000 for this model on a single T4 GPU environment. 
The training time was around 20 mins. 

#### Inferencing

The models can be used for inferencing following the steps below:

**Load the model**

```
    # init from a model saved in a specific directory
    checkpoint = torch.load('./out/ckpt.pt', map_location=device)
    gptconf = GPTConfig(**checkpoint['model_args'])
    model = GPT(gptconf)
    state_dict = checkpoint['model']
    unwanted_prefix = '_orig_mod.'
    for k,v in list(state_dict.items()):
        if k.startswith(unwanted_prefix):
            state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
    model.load_state_dict(state_dict)
```
**Sample generation!** 

```
start_ids = encode(start)
m_tensor = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
with torch.no_grad():
    with ctx:
        for k in range(num_samples):
            y = model.generate(m_tensor, max_new_tokens, temperature=temperature, top_k=top_k)
            print(decode(y[0].tolist()))
            print('---------------')
```

- `num_samples`, `top_k`, `max_new_tokens`, `temperature` etc are all hyper-parameters. 
- `start` is the beginning text or prompt we'd want to provide for text generation.

#### Samples 

- **Start:** "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः "
- **Generation:** 
  "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः पुरुषादकः
  स गतान पराहुर उदायात
     दुर्यॊधन्य आयुष्यतीति
  उवाच
  अश्नव अपि चित्राङ्गदॊ दिशस तव
     धृतराष्ट्रम अभिमन्त्रिणं सर्व एव च
  एकचक्राद्य वधॊ दिव्याभिवाद्य च
     परिक्षीविद एवं शूराः स चेहं बलम
  
  मुक्तार्थायं तथा यज्ञैर अग्रतः कुलवर्धनः
      परत्याच चकुरूणाम इति शरुत्वा वसु वहॊ ऽबरवीत
  
  कामायाः स च नामधेया भभूमिः सुताः
      परतिषेत्य उक्त्वा सॊ ऽनवादितः सह
  
  अथ तत्र तथा पुष्पाणा बहुभिर भयात
      स भार्या महाबाहुर बरह्मणावन अभिषेचय
  
  कार्ये तथान्ये महीपालाः संविदाः" 

- **English Translation:** 
  "
    There are many pious stories told by Krishna Dvaipayana, *the man-eater*.
    
    He said, "Duryodhana will live. Ashnav also shows you the picturesque Dhritarashtra, the minister, and all."
    
    "One-wheeled killing and divine salutation. The testers and the heroes are that, and I am the strength."
    
    "He is thus the enhancer of the family by sacrifices for the sake of liberation. Hearing that he had returned to the Chakurus, Vasu Vaho 'barvita."
    
    "The sons of Kama, whose name was Bhabhumi. He turned around and said he was unplayed with."
    
    "Then there were many flowers there. He consecrated his wife, the mighty-armed, as a Brahmin."
    
    "At work and other governors contracts."
  "