README.md · oshizo/japanese-e5-mistral-7b

metadata

license: mit
language:
  - ja
pipeline_tag: sentence-similarity

This model was created by merging intfloat/e5-mistral-7b-instruct and stabilityai/japanese-stablelm-base-gamma-7b.
See intfloat/e5-mistral-7b-instruct page or evaluation notebook of oshizo/JapaneseEmbeddingEval for model usage.

The steps to merge are as follows.

Load intfloat/e5-mistral-7b-instruct as a "MistralForCausalLM" class and save_pretrained as is.

Because e5-mistral-7b-instruct is made with the "MistralModel" class, it could not be merged with "MistraForCausalLM" as is.
In my environment, I had to load into the CPU, not the GPU, or I would get an error.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "intfloat/e5-mistral-7b-instruct"
model = AutoModelForCausalLM.from_pretrained(model_id)#, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

model.save_pretrained("./e5-mistral-7b-instruct_with_lm_head")

Merge using mergekit with the following yaml configuration

merge_config.yaml

models:
  - model: stabilityai/japanese-stablelm-base-gamma-7b
  - model: ./e5-mistral-7b-instruct_with_lm_head
base_model: stabilityai/japanese-stablelm-base-gamma-7b
parameters:
  t:
    - value: [0.5, 0.9]

merge_method: slerp
dtype: float16

I tried the "linear", "slerp", and "task_arithmetic" merging methods, and this setting seemed to be the best.
The choice of "t" parameters was set to use more japanese-stablelm-base-gamma-7b for the layer closer to the input to take advantage of Japanese word understanding, and more e5-mistral-7b-instruct for the layer closer to the output to generate good embeddings.
As for the "ties" method, I could not find any parameters for density and weight that worked properly.

Copy settings related to pad_token from the e5-mistral-7b-instruct repository.

config.json
tokenizer.json
tokenizer.model
tokenizer_config.json
special_tokens_map.json