How to use the "mlp2x_gelu_Norm"?
It seems Yi-VL-6B uses a mlp2x_gelu_Norm
as multimodal projector. Will this be with any difference from the original llava's mlp2x_gelu
? Is the following implementation a correct one to load the model?
Sequential(
(0): Linear(in_features=1280, out_features=4096, bias=True)
(1): GELU(approximate='none')
(2): LayerNorm((4096,), eps=1e-05, elementwise_affine=False)
(3): Linear(in_features=4096, out_features=4096, bias=True)
)```
Yes, this is quite strange and seems very different from LLaVA's MLP2x-GeLU.
Hope Yi team can assist on providing some code demos for this.
I fixed it just now.
Firstly, add the codes below in /LLaVA/llava/model/multimodal_projector/builder.py
class MLP2xGELUNorm(nn.Module):
def init(self, input_size, hidden_size):
super().init()
self.linear1 = nn.Linear(input_size, hidden_size)
self.gelu1 = nn.GELU()
self.linear2 = nn.Linear(hidden_size, hidden_size)
self.gelu2 = nn.GELU()
def forward(self, x):
x = self.linear1(x)
x = self.gelu1(x)
x = self.linear2(x)
x = self.gelu2(x)
return x
Then, add the codes below in build_vision_projector()
if projector_type == 'mlp2x_gelu_Norm':
return MLP2xGELUNorm(config.mm_hidden_size, config.hidden_size)
DONE!
but the structure is same as yi team used?
if the structure is not same as they used , the capability is not same as well
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.
I change the value of conv-mode
to llava_v0 and modify the value of conv.system
to "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。", it looks normal.
Another issue I’ve encountered is that when I test the question, “Is anyone smoking?” I experience numerous hallucinations. some other setting might be incorrect. What other adjustments might I need to make?
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.
not work for me , can you provide the mlp_projector and cli code ?
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.not work for me , can you provide the mlp_projector and cli code ?
my: mlp_projector
mlp_gelu_norm_match = re.match(r'^mlp(\d+)x_gelu_Norm$', projector_type)
if mlp_gelu_norm_match:
mlp_depth = int(mlp_gelu_norm_match.group(1))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.LayerNorm(config.hidden_size))
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
modules.append(nn.LayerNorm(config.hidden_size))
return nn.Sequential(*modules)i don't use cli code, i use YI-VL in two method, eval/run_llava (anther change is that, cause llava project bind with str
llava
in several places , so i set a soft link, namedYi-VL-6B
tollava_Yi-VL-6B
) and test on gradio_demo:
def eval_model(args):
# Model
disable_torch_init()device = "5" model_name = get_model_name_from_path(args.model_path) tokenizer, model, image_processor, context_len = load_pretrained_model( args.model_path, args.model_base, model_name, device_map=f"cuda:{device}", device=f"cuda:{device}" ) qs = args.query image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN if IMAGE_PLACEHOLDER in qs: if model.config.mm_use_im_start_end: qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs) else: qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs) else: if model.config.mm_use_im_start_end: qs = image_token_se + "\n" + qs else: qs = DEFAULT_IMAGE_TOKEN + "\n" + qs if "llama-2" in model_name.lower(): conv_mode = "llava_llama_2" elif "v1" in model_name.lower(): conv_mode = "llava_v1" elif "mpt" in model_name.lower(): conv_mode = "mpt" else: conv_mode = "llava_v0" if args.conv_mode is not None and conv_mode != args.conv_mode: print( "[WARNING] the auto inferred conversation mode is {}, while `--conv-mode` is {}, using {}".format( conv_mode, args.conv_mode, args.conv_mode ) ) else: args.conv_mode = conv_mode conv = conv_templates[args.conv_mode].copy() conv.system = "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。" conv.append_message(conv.roles[0], qs) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt() image_files = image_parser(args) images = load_images(image_files) images_tensor = process_images( images, image_processor, model.config ).to(model.device, dtype=torch.float16) input_ids = ( tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt") .unsqueeze(0).to(model.device) ) stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2 keywords = [stop_str] stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids) with torch.inference_mode(): output_ids = model.generate( input_ids, images=images_tensor, do_sample=True if args.temperature > 0 else False, temperature=args.temperature, top_p=args.top_p, num_beams=args.num_beams, max_new_tokens=args.max_new_tokens, use_cache=True, stopping_criteria=[stopping_criteria], ) input_token_len = input_ids.shape[1] n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item() if n_diff_input_output > 0: print( f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids" ) outputs = tokenizer.batch_decode( output_ids[:, input_token_len:], skip_special_tokens=True )[0] outputs = outputs.strip() if outputs.endswith(stop_str): outputs = outputs[: -len(stop_str)] outputs = outputs.strip() print(outputs)
if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="/path_of_model/llava_Yi-VL-6B")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image-file", type=str, default="/path_of_image/llava_logo.png")
parser.add_argument("--query", type=str,default="图中有多少大象?")
parser.add_argument("--conv-mode", type=str, default="llava_v0")
parser.add_argument("--sep", type=str, default=",")
parser.add_argument("--temperature", type=float, default=0.2)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
parser.add_argument("--max_new_tokens", type=int, default=512)
args = parser.parse_args()eval_model(args)
i put layernorm in wrong place i think this maybe the point,
have you notic that if you use llava_v0, the stopping_criteria'keystr would be set as "###", but maybe is has no influence
and yi has provide their generation config in generation_config.json
btw thaks for your sharing, i'll share my progress when i make it work well.
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.not work for me , can you provide the mlp_projector and cli code ?
It seems an offcial version has been provided, https://github.com/01-ai/Yi/blob/liuyudong/yi_vl/VL/single_inference.py
Thanks, I fixed it too.
I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.not work for me , can you provide the mlp_projector and cli code ?
It seems an offcial version has been provided, https://github.com/01-ai/Yi/blob/liuyudong/yi_vl/VL/single_inference.py
I see! now the work change to how to fine-tune it
As the above link does not work, here a correct one in the yi repo:
https://github.com/01-ai/Yi/blob/main/VL/single_inference.py