Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,54 @@
|
|
1 |
---
|
2 |
license: apache-2.0
|
|
|
3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
---
|
2 |
license: apache-2.0
|
3 |
+
pipeline_tag: text-generation
|
4 |
---
|
5 |
+
|
6 |
+
# Bigram Language Model
|
7 |
+
|
8 |
+
## Overview
|
9 |
+
This repository contains a simple Bigram Language Model implemented in PyTorch. The model is trained to predict the next character in a sequence, given the current character. It's a character-level model and can be used for tasks like text generation.
|
10 |
+
|
11 |
+
## Model Details
|
12 |
+
- **Model Type**: Character-level Language Model
|
13 |
+
- **Architecture**: Simple lookup table for character bigrams
|
14 |
+
- **Training Data**: [https://huggingface.co/datasets/csebuetnlp/xlsum/viewer/bengali]
|
15 |
+
|
16 |
+
## Requirements
|
17 |
+
- Python 3.x
|
18 |
+
- PyTorch
|
19 |
+
- JSON (for loading the tokenizer)
|
20 |
+
|
21 |
+
## Installation
|
22 |
+
First, clone this repository:
|
23 |
+
|
24 |
+
|
25 |
+
## Loading the Model
|
26 |
+
To load the model, you need to initialize it with the vocabulary size and load the pre-trained weights:
|
27 |
+
|
28 |
+
```python
|
29 |
+
import torch
|
30 |
+
from model import BigramLanguageModel
|
31 |
+
|
32 |
+
vocab_size = 225
|
33 |
+
model = BigramLanguageModel(vocab_size)
|
34 |
+
|
35 |
+
model.load_state_dict(torch.load('path_to_your_model.pth', map_location=torch.device('cpu')))
|
36 |
+
model.eval()
|
37 |
+
|
38 |
+
import json
|
39 |
+
|
40 |
+
with open('tokenizer_mappings.json', 'r', encoding='utf-8') as f:
|
41 |
+
mappings = json.load(f)
|
42 |
+
stoi = mappings['stoi']
|
43 |
+
itos = mappings['itos']
|
44 |
+
|
45 |
+
# Example usage
|
46 |
+
encode = lambda s: [stoi[c] for c in s]
|
47 |
+
decode = lambda l: ''.join([itos[i] for i in l])
|
48 |
+
|
49 |
+
|
50 |
+
context = torch.tensor([encode("Your initial text")], dtype=torch.long)
|
51 |
+
generated_text_indices = model.generate(context, max_new_tokens=100)
|
52 |
+
print(decode(generated_text_indices[0].tolist()))
|
53 |
+
|
54 |
+
|