nielsr HF staff commited on
Commit
d7f9c48
1 Parent(s): c2e74c6

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -56,6 +56,34 @@ output = model.generate(**inputs, max_new_tokens=100)
56
  print(processor.decode(output[0], skip_special_tokens=True))
57
  ```
58
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
59
  ### BibTeX entry and citation info
60
 
61
  ```bibtex
 
56
  print(processor.decode(output[0], skip_special_tokens=True))
57
  ```
58
 
59
+ ### Model optimization
60
+
61
+ #### 4-bit quantization through `bitsandbytes` library
62
+
63
+ First make sure to install `bitsandbytes`, `pip install bitsandbytes` and make sure to have access to a CUDA compatible GPU device. Simply change the snippet above with:
64
+
65
+ ```diff
66
+ model = LlavaNextForConditionalGeneration.from_pretrained(
67
+ model_id,
68
+ torch_dtype=torch.float16,
69
+ low_cpu_mem_usage=True,
70
+ + load_in_4bit=True
71
+ )
72
+ ```
73
+
74
+ #### Use Flash-Attention 2 to further speed-up generation
75
+
76
+ First make sure to install `flash-attn`. Refer to the [original repository of Flash Attention](https://github.com/Dao-AILab/flash-attention) regarding that package installation. Simply change the snippet above with:
77
+
78
+ ```diff
79
+ model = LlavaNextForConditionalGeneration.from_pretrained(
80
+ model_id,
81
+ torch_dtype=torch.float16,
82
+ low_cpu_mem_usage=True,
83
+ + use_flash_attention_2=True
84
+ ).to(0)
85
+ ```
86
+
87
  ### BibTeX entry and citation info
88
 
89
  ```bibtex