keshi87 commited on
Commit
19c9145
1 Parent(s): 5699c1c

Updated Readme. (#1)

Browse files

- Updated Readme. (8839f077ccabe4c38472a01950f2080c872c0abb)

Files changed (1) hide show
  1. README.md +124 -1
README.md CHANGED
@@ -12,4 +12,127 @@ co2_eq_emissions:
12
  emissions: 0.443
13
  datasets:
14
  - keshi87/mahabharat.txt
15
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  emissions: 0.443
13
  datasets:
14
  - keshi87/mahabharat.txt
15
+ ---
16
+
17
+ ### Sanskrit GPT
18
+
19
+ A small Pretrained **10.65M** parameter model generated from the raw text of the original mahabharat - one of the longest epics in the world.
20
+
21
+
22
+ #### Model Details
23
+
24
+ The vocabulary of the model is composed of the following set
25
+
26
+ ```
27
+ अआइईउऊऋएऐओऔकखगघङचछजझञटठडढणतथदधनपफबभमयरलवशषसहऽािीुूृॄेैॊौ्ॢ
28
+
29
+ ```
30
+ That's 65 characters.
31
+
32
+ #### Tokenizer
33
+
34
+ For training the a very basic tokenizer was used. Basically it renders the ascii value to each character without accounting
35
+ for multiple characters in the attention.
36
+
37
+ ```
38
+ # create a mapping from characters to integers
39
+ stoi = { ch:i for i,ch in enumerate(chars) }
40
+ itos = { i:ch for i,ch in enumerate(chars) }
41
+ def encode(s):
42
+ return [stoi[c] for c in s] # encoder: take a string, output a list of integers
43
+ def decode(l):
44
+ return ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string
45
+ ```
46
+
47
+ Obviously much more advanced tokenization schemes can be explored here but since this was an experiment to learn LLM model pretraining from
48
+ scratch, keeping things really basic.
49
+
50
+ #### Training
51
+
52
+ Some basic parameters used for training the model were as follows
53
+
54
+ ```
55
+ n_layer = 6
56
+ n_head = 6
57
+ n_embd = 384
58
+ dropout = 0.2
59
+ learning_rate = 1e-3
60
+ ```
61
+ The training epochs were limited to 5000 for this model on a single T4 GPU environment.
62
+ The training time was around 20 mins.
63
+
64
+ #### Inferencing
65
+
66
+ The models can be used for inferencing following the steps below:
67
+
68
+ **Load the model**
69
+
70
+ ```
71
+ # init from a model saved in a specific directory
72
+ checkpoint = torch.load('./out/ckpt.pt', map_location=device)
73
+ gptconf = GPTConfig(**checkpoint['model_args'])
74
+ model = GPT(gptconf)
75
+ state_dict = checkpoint['model']
76
+ unwanted_prefix = '_orig_mod.'
77
+ for k,v in list(state_dict.items()):
78
+ if k.startswith(unwanted_prefix):
79
+ state_dict[k[len(unwanted_prefix):]] = state_dict.pop(k)
80
+ model.load_state_dict(state_dict)
81
+ ```
82
+ **Sample generation!**
83
+
84
+ ```
85
+ start_ids = encode(start)
86
+ m_tensor = (torch.tensor(start_ids, dtype=torch.long, device=device)[None, ...])
87
+ with torch.no_grad():
88
+ with ctx:
89
+ for k in range(num_samples):
90
+ y = model.generate(m_tensor, max_new_tokens, temperature=temperature, top_k=top_k)
91
+ print(decode(y[0].tolist()))
92
+ print('---------------')
93
+ ```
94
+
95
+ - `num_samples`, `top_k`, `max_new_tokens`, `temperature` etc are all hyper-parameters.
96
+ - `start` is the beginning text or prompt we'd want to provide for text generation.
97
+
98
+ #### Samples
99
+
100
+ - **Start:** "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः "
101
+ **Generation:**
102
+ "कृष्णद्वैपायन परॊक्ताः सुपुण्या विविधाः कथाः पुरुषादकः
103
+ स गतान पराहुर उदायात
104
+ दुर्यॊधन्य आयुष्यतीति
105
+ उवाच
106
+ अश्नव अपि चित्राङ्गदॊ दिशस तव
107
+ धृतराष्ट्रम अभिमन्त्रिणं सर्व एव च
108
+ एकचक्राद्य वधॊ दिव्याभिवाद्य च
109
+ परिक्षीविद एवं शूराः स चेहं बलम
110
+
111
+ मुक्तार्थायं तथा यज्ञैर अग्रतः कुलवर्धनः
112
+ परत्याच चकुरूणाम इति शरुत्वा वसु वहॊ ऽबरवीत
113
+
114
+ कामायाः स च नामधेया भभूमिः सुताः
115
+ परतिषेत्य उक्त्वा सॊ ऽनवादितः सह
116
+
117
+ अथ तत्र तथा पुष्पाणा बहुभिर भयात
118
+ स भार्या महाबाहुर बरह्मणावन अभिषेचय
119
+
120
+ कार्ये तथान्ये महीपालाः संविदाः"
121
+ **English Translation:**
122
+ "
123
+ There are many pious stories told by Krishna Dvaipayana, the man-eater.
124
+
125
+ He said, "Duryodhana will live. Ashnav also shows you the picturesque Dhritarashtra, the minister, and all."
126
+
127
+ "One-wheeled killing and divine salutation. The testers and the heroes are that, and I am the strength."
128
+
129
+ "He is thus the enhancer of the family by sacrifices for the sake of liberation. Hearing that he had returned to the Chakurus, Vasu Vaho 'barvita."
130
+
131
+ "The sons of Kama, whose name was Bhabhumi. He turned around and said he was unplayed with."
132
+
133
+ "Then there were many flowers there. He consecrated his wife, the mighty-armed, as a Brahmin."
134
+
135
+ "At work and other governors contracts."
136
+ "
137
+
138
+