nielsr HF staff commited on
Commit
0524afe
1 Parent(s): f32b62a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -55,6 +55,34 @@ output = model.generate(**inputs, max_new_tokens=100)
55
  print(processor.decode(output[0], skip_special_tokens=True))
56
  ```
57
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
  ### BibTeX entry and citation info
59
 
60
  ```bibtex
 
55
  print(processor.decode(output[0], skip_special_tokens=True))
56
  ```
57
 
58
+ ### Model optimization
59
+
60
+ #### 4-bit quantization through `bitsandbytes` library
61
+
62
+ First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
63
+
64
+ ```diff
65
+ model = LlavaNextForConditionalGeneration.from_pretrained(
66
+ model_id,
67
+ torch_dtype=torch.float16,
68
+ low_cpu_mem_usage=True,
69
+ + load_in_4bit=True
70
+ )
71
+ ```
72
+
73
+ #### Use Flash-Attention 2 to further speed-up generation
74
+
75
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
76
+
77
+ ```diff
78
+ model = LlavaNextForConditionalGeneration.from_pretrained(
79
+ model_id,
80
+ torch_dtype=torch.float16,
81
+ low_cpu_mem_usage=True,
82
+ + use_flash_attention_2=True
83
+ ).to(0)
84
+ ```
85
+
86
  ### BibTeX entry and citation info
87
 
88
  ```bibtex