SanathNarayan commited on
Commit
e93fa8a
·
verified ·
1 Parent(s): e611c11

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -4
README.md CHANGED
@@ -15,7 +15,6 @@ For enhancing the VLM's perception of fine-grained details w.r.t small objects i
15
 
16
  🤗 To get started with Falcon-vlm (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://huggingface.co/blog/falcon)!
17
 
18
- ⚠️ **This is a raw, pretrained model, which should be further finetuned for most usecases.**
19
 
20
  ```python
21
  from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
@@ -49,7 +48,7 @@ print(generated_captions)
49
 
50
  For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](https://huggingface.co/blog/falcon).
51
 
52
- # Model Card for Falcon2-11B
53
 
54
  ## Model Details
55
 
@@ -76,13 +75,13 @@ Production use without adequate assessment of risks and mitigation; any use case
76
 
77
  ## Bias, Risks, and Limitations
78
 
79
- Falcon2-11B-vlm is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
80
 
81
  ## Training Details
82
 
83
  The training is done in two stages: pretraining and finetuning. In both stages, the visual encoder weights are kept frozen. In the pretraining stage, the LLM is kept frozen and only the multimodal projector is trained on 558K image-caption pairs.
84
  This enables the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, both the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.
85
- Falcon2-11B- was trained on 16 A100 80GB GPUs with ZeRO and Flash-Attention 2.
86
 
87
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
88
 
 
15
 
16
  🤗 To get started with Falcon-vlm (inference, finetuning, quantization, etc.), we recommend reading [this great blogpost from HF](https://huggingface.co/blog/falcon)!
17
 
 
18
 
19
  ```python
20
  from transformers import LlavaNextForConditionalGeneration, LlavaNextProcessor
 
48
 
49
  For fast inference with Falcon, check-out [Text Generation Inference](https://github.com/huggingface/text-generation-inference)! Read more in this [blogpost](https://huggingface.co/blog/falcon).
50
 
51
+ # Model Card for Falcon2-11B-VLM
52
 
53
  ## Model Details
54
 
 
75
 
76
  ## Bias, Risks, and Limitations
77
 
78
+ Falcon2-11B is trained mostly on English, but also German, Spanish, French, Italian, Portuguese, Polish, Dutch, Romanian, Czech, Swedish. It will not generalize appropriately to other languages. Furthermore, as it is trained on a large-scale corpora representative of the web, it will carry the stereotypes and biases commonly encountered online.
79
 
80
  ## Training Details
81
 
82
  The training is done in two stages: pretraining and finetuning. In both stages, the visual encoder weights are kept frozen. In the pretraining stage, the LLM is kept frozen and only the multimodal projector is trained on 558K image-caption pairs.
83
  This enables the multimodal projector to learn a mapping from visual to text embedding space. During finetuning, both the projector and LLM weights are trained on a corpus of 1.2M image-text instruction data from public datasets, which also includes multi-round conversations.
84
+ Falcon2-11B-VLM was trained on 16 A100 80GB GPUs with ZeRO and Flash-Attention 2.
85
 
86
  The data was tokenized with the Falcon-[7B](https://huggingface.co/tiiuae/falcon-7b)/[11B](https://huggingface.co/tiiuae/falcon-11B) tokenizer.
87