Text Generation
Transformers
Safetensors
English
falcon_mamba
Inference Endpoints
4-bit precision
bitsandbytes
ybelkada commited on
Commit
120bb9e
·
verified ·
1 Parent(s): 02b1347

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -77
README.md CHANGED
@@ -46,7 +46,7 @@ Find below some example scripts on how to use the model in `transformers` (Make
46
 
47
  ## Using the Pytorch model
48
 
49
- ### Running the model on a CPU
50
 
51
  <details>
52
  <summary> Click to expand </summary>
@@ -54,8 +54,8 @@ Find below some example scripts on how to use the model in `transformers` (Make
54
  ```python
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
56
 
57
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
58
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b")
59
 
60
  input_text = "Question: How many hours in one day? Answer: "
61
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids
@@ -66,89 +66,21 @@ print(tokenizer.decode(outputs[0]))
66
 
67
  </details>
68
 
69
- ### Running the model on a GPU
70
 
71
- <details>
72
- <summary> Click to expand </summary>
73
-
74
- ```python
75
- # pip install accelerate
76
- from transformers import AutoTokenizer, AutoModelForCausalLM
77
-
78
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
79
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto")
80
-
81
- input_text = "Question: How many hours in one day? Answer: "
82
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
83
-
84
- outputs = model.generate(input_ids)
85
- print(tokenizer.decode(outputs[0]))
86
- ```
87
-
88
- </details>
89
-
90
- ### Running the model on a GPU using `torch.compile`
91
 
92
  <details>
93
  <summary> Click to expand </summary>
94
 
95
  ```python
96
- import torch
97
  from transformers import AutoTokenizer, AutoModelForCausalLM
98
 
99
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
100
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", torch_dtype=torch.bfloat16).to(0)
101
-
102
- model = torch.compile(model)
103
 
104
  input_text = "Question: How many hours in one day? Answer: "
105
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
106
-
107
- outputs = model.generate(input_ids)
108
- print(tokenizer.decode(outputs[0]))
109
- ```
110
-
111
- </details>
112
-
113
-
114
- ### Running the model on a GPU using different precisions
115
-
116
- #### FP16
117
-
118
- <details>
119
- <summary> Click to expand </summary>
120
-
121
- ```python
122
- # pip install accelerate
123
- import torch
124
- from transformers import AutoTokenizer, AutoModelForCausalLM
125
-
126
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
127
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", torch_dtype=torch.float16)
128
-
129
- input_text = "Question: How many hours in one day? Answer: "
130
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
131
-
132
- outputs = model.generate(input_ids)
133
- print(tokenizer.decode(outputs[0]))
134
- ```
135
-
136
- </details>
137
-
138
- #### 4-bit
139
-
140
- <details>
141
- <summary> Click to expand </summary>
142
-
143
- ```python
144
- # pip install bitsandbytes accelerate
145
- from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
146
-
147
- tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b")
148
- model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b", device_map="auto", quantization_config=BitsAndBytesConfig(load_in_4bit=True))
149
-
150
- input_text = "Question: How many hours in one day? Answer: "
151
- input_ids = tokenizer(input_text, return_tensors="pt").input_ids.to("cuda")
152
 
153
  outputs = model.generate(input_ids)
154
  print(tokenizer.decode(outputs[0]))
@@ -157,7 +89,6 @@ print(tokenizer.decode(outputs[0]))
157
  </details>
158
 
159
 
160
-
161
  # Training Details
162
 
163
  ## Training Data
 
46
 
47
  ## Using the Pytorch model
48
 
49
+ This checkpoint will only run on a GPU device with `bitsandbytes` installed. See below for more details on how to load it
50
 
51
  <details>
52
  <summary> Click to expand </summary>
 
54
  ```python
55
  from transformers import AutoTokenizer, AutoModelForCausalLM
56
 
57
+ tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-4bit")
58
+ model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-4bit")
59
 
60
  input_text = "Question: How many hours in one day? Answer: "
61
  input_ids = tokenizer(input_text, return_tensors="pt").input_ids
 
66
 
67
  </details>
68
 
69
+ You can also dequantize the model with `model.dequantize()` method:
70
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
71
 
72
  <details>
73
  <summary> Click to expand </summary>
74
 
75
  ```python
 
76
  from transformers import AutoTokenizer, AutoModelForCausalLM
77
 
78
+ tokenizer = AutoTokenizer.from_pretrained("tiiuae/falcon-mamba-7b-4bit")
79
+ model = AutoModelForCausalLM.from_pretrained("tiiuae/falcon-mamba-7b-4bit")
80
+ model = model.dequantize()
 
81
 
82
  input_text = "Question: How many hours in one day? Answer: "
83
+ input_ids = tokenizer(input_text, return_tensors="pt").input_ids
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
84
 
85
  outputs = model.generate(input_ids)
86
  print(tokenizer.decode(outputs[0]))
 
89
  </details>
90
 
91
 
 
92
  # Training Details
93
 
94
  ## Training Data