How to use the "mlp2x_gelu_Norm"?

#1
by teowu - opened

It seems Yi-VL-6B uses a mlp2x_gelu_Norm as multimodal projector. Will this be with any difference from the original llava's mlp2x_gelu? Is the following implementation a correct one to load the model?

Sequential(
  (0): Linear(in_features=1280, out_features=4096, bias=True)
  (1): GELU(approximate='none')
  (2): LayerNorm((4096,), eps=1e-05, elementwise_affine=False)
  (3): Linear(in_features=4096, out_features=4096, bias=True)
)```

I think the weight is quite different compared with llava, llava has 2 weight and 2 bias , yi has 4 bias and 4 weight
截屏2024-01-22 17.59.13.png

Yes, this is quite strange and seems very different from LLaVA's MLP2x-GeLU.
Hope Yi team can assist on providing some code demos for this.

I fixed it just now.

Firstly, add the codes below in /LLaVA/llava/model/multimodal_projector/builder.py

class MLP2xGELUNorm(nn.Module):
def init(self, input_size, hidden_size):
super().init()
self.linear1 = nn.Linear(input_size, hidden_size)
self.gelu1 = nn.GELU()
self.linear2 = nn.Linear(hidden_size, hidden_size)
self.gelu2 = nn.GELU()

def forward(self, x):
    x = self.linear1(x)
    x = self.gelu1(x)
    x = self.linear2(x)
    x = self.gelu2(x)
    return x

Then, add the codes below in build_vision_projector()
if projector_type == 'mlp2x_gelu_Norm':
return MLP2xGELUNorm(config.mm_hidden_size, config.hidden_size)

DONE!

but the structure is same as yi team used?
if the structure is not same as they used , the capability is not same as well

I fixed
use mlp2x_glue and add 2 norm layer
Pytorch will save layer normal parameter so it will be 4 weight parameter
截屏2024-01-22 19.29.17.png

find from yi webset

This comment has been hidden

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

I change the value of conv-mode to llava_v0 and modify the value of conv.system to "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。", it looks normal.

Another issue I’ve encountered is that when I test the question, “Is anyone smoking?” I experience numerous hallucinations. some other setting might be incorrect. What other adjustments might I need to make?

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

not work for me , can you provide the mlp_projector and cli code ?

This comment has been hidden

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

not work for me , can you provide the mlp_projector and cli code ?
my: mlp_projector
mlp_gelu_norm_match = re.match(r'^mlp(\d+)x_gelu_Norm$', projector_type)
if mlp_gelu_norm_match:
mlp_depth = int(mlp_gelu_norm_match.group(1))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.LayerNorm(config.hidden_size))
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
modules.append(nn.LayerNorm(config.hidden_size))
return nn.Sequential(*modules)

i don't use cli code, i use YI-VL in two method, eval/run_llava (anther change is that, cause llava project bind with str llava in several places , so i set a soft link, named Yi-VL-6B to llava_Yi-VL-6B) and test on gradio_demo:
def eval_model(args):
# Model
disable_torch_init()

device = "5"
model_name = get_model_name_from_path(args.model_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(
    args.model_path, args.model_base, model_name, device_map=f"cuda:{device}", device=f"cuda:{device}"
)

qs = args.query
image_token_se = DEFAULT_IM_START_TOKEN + DEFAULT_IMAGE_TOKEN + DEFAULT_IM_END_TOKEN
if IMAGE_PLACEHOLDER in qs:
    if model.config.mm_use_im_start_end:
        qs = re.sub(IMAGE_PLACEHOLDER, image_token_se, qs)
    else:
        qs = re.sub(IMAGE_PLACEHOLDER, DEFAULT_IMAGE_TOKEN, qs)
else:
    if model.config.mm_use_im_start_end:
        qs = image_token_se + "\n" + qs
    else:
        qs = DEFAULT_IMAGE_TOKEN + "\n" + qs

if "llama-2" in model_name.lower():
    conv_mode = "llava_llama_2"
elif "v1" in model_name.lower():
    conv_mode = "llava_v1"
elif "mpt" in model_name.lower():
    conv_mode = "mpt"
else:
    conv_mode = "llava_v0"

if args.conv_mode is not None and conv_mode != args.conv_mode:
    print(
        "[WARNING] the auto inferred conversation mode is {}, while `--conv-mode` is {}, using {}".format(
            conv_mode, args.conv_mode, args.conv_mode
        )
    )
else:
    args.conv_mode = conv_mode

conv = conv_templates[args.conv_mode].copy()
conv.system = "This is a chat between an inquisitive human and an AI assistant. Assume the role of the AI assistant. Read all the images carefully, and respond to the human's questions with informative, helpful, detailed and polite answers. 这是一个好奇的人类和一个人工智能助手之间的对话。假设你扮演这个AI助手的角色。仔细阅读所有的图像,并对人类的问题做出信息丰富、有帮助、详细的和礼貌的回答。"
conv.append_message(conv.roles[0], qs)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

image_files = image_parser(args)
images = load_images(image_files)
images_tensor = process_images(
    images,
    image_processor,
    model.config
).to(model.device, dtype=torch.float16)

input_ids = (
    tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt")
    .unsqueeze(0).to(model.device)
)

stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
keywords = [stop_str]
stopping_criteria = KeywordsStoppingCriteria(keywords, tokenizer, input_ids)

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=images_tensor,
        do_sample=True if args.temperature > 0 else False,
        temperature=args.temperature,
        top_p=args.top_p,
        num_beams=args.num_beams,
        max_new_tokens=args.max_new_tokens,
        use_cache=True,
        stopping_criteria=[stopping_criteria],
    )

input_token_len = input_ids.shape[1]
n_diff_input_output = (input_ids != output_ids[:, :input_token_len]).sum().item()
if n_diff_input_output > 0:
    print(
        f"[Warning] {n_diff_input_output} output_ids are not the same as the input_ids"
    )
outputs = tokenizer.batch_decode(
    output_ids[:, input_token_len:], skip_special_tokens=True
)[0]
outputs = outputs.strip()
if outputs.endswith(stop_str):
    outputs = outputs[: -len(stop_str)]
outputs = outputs.strip()
print(outputs)

if name == "main":
parser = argparse.ArgumentParser()
parser.add_argument("--model-path", type=str, default="/path_of_model/llava_Yi-VL-6B")
parser.add_argument("--model-base", type=str, default=None)
parser.add_argument("--image-file", type=str, default="/path_of_image/llava_logo.png")
parser.add_argument("--query", type=str,default="图中有多少大象?")
parser.add_argument("--conv-mode", type=str, default="llava_v0")
parser.add_argument("--sep", type=str, default=",")
parser.add_argument("--temperature", type=float, default=0.2)
parser.add_argument("--top_p", type=float, default=None)
parser.add_argument("--num_beams", type=int, default=1)
parser.add_argument("--max_new_tokens", type=int, default=512)
args = parser.parse_args()

eval_model(args)

i put layernorm in wrong place i think this maybe the point,
have you notic that if you use llava_v0, the stopping_criteria'keystr would be set as "###", but maybe is has no influence
and yi has provide their generation config in generation_config.json
btw thaks for your sharing, i'll share my progress when i make it work well.

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

not work for me , can you provide the mlp_projector and cli code ?

It seems an offcial version has been provided, https://github.com/01-ai/Yi/blob/liuyudong/yi_vl/VL/single_inference.py

Thanks, I fixed it too.

I modified the system prompt, and encountered another issue during the inference.
Through CLI inference proviede by LLaVA, this model looks like a repeater, sometimes repeating the last token, and at other times repeating an entire sentence. so does 34B. Moreover, there is a multi-round dialogue when generating one response.
It looks so strange.

not work for me , can you provide the mlp_projector and cli code ?

It seems an offcial version has been provided, https://github.com/01-ai/Yi/blob/liuyudong/yi_vl/VL/single_inference.py
I see! now the work change to how to fine-tune it

As the above link does not work, here a correct one in the yi repo:
https://github.com/01-ai/Yi/blob/main/VL/single_inference.py

Sign up or log in to comment