File size: 1,216 Bytes
4491db9
 
6134d2b
4491db9
 
 
 
 
8f91399
4491db9
6134d2b
e734db6
4491db9
 
26afe6e
4491db9
26afe6e
692d5a8
 
26afe6e
 
 
692d5a8
dd0445f
26afe6e
692d5a8
26afe6e
 
 
 
692d5a8
67354a2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
---
language:
- bn
license: apache-2.0
tags:
- transformers
- llama
- trl
- sft
base_model: unsloth/llama-3-8b-bnb-4bit
library_name: transformers
pipeline_tag: question-answering
---

How to use it:

# Use a pipeline as a high-level helper

```python
from transformers import pipeline

pipe = pipeline("question-answering", model="asif00/bangla-llama-4bit")
```

# Load model directly
```python
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("asif00/bangla-llama-4bit")
model = AutoModelForCausalLM.from_pretrained("asif00/bangla-llama-4bit")
```


# To get a cleaned up version of the response, you can use:

```python
def generate_response(question, context):
    inputs = tokenizer([
        prompt.format(
            question,
            context, 
            ""
        )
    ], return_tensors="pt").to("cuda")

    outputs = model.generate(**inputs, max_new_tokens=1024, use_cache=True)
    responses = tokenizer.batch_decode(outputs, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
    response_start = responses.find("### Response:") + len("### Response:")
    response = responses[response_start:].strip()
    return response
```