Average of vectors

#2
by conceptofmind - opened

Thank you for your great work.

I was wondering if you could point to reference code for this:

"Therefore, we decided to use the average of the vectors corresponding to the original tokens of embed_tokensand each vector) as the initial value of the vectors corresponding to the added tokens (e.g., the vector of ). "

I greatly appreciate your help.

ELYZA.inc org
β€’
edited Sep 22, 2023

Thanks for your interest in our model.

We took the average of the vectors as our initial value with the following code:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16)

def replace_emb_by_original_emb_mean(new2old_index_mapping):
    # new2old_index_mapping: {new_id1: [old_id_1_1, old_id_1_2, ...], new_id2: [old_id_2_1, ...], ...}
    # input_emb: model.model.embed_tokens.weight
    # output_emb: model.lm_head.weight
    with torch.no_grad():
        for new_id, old_ids in new2old_index_mapping.items():
            new_input_emb = torch.mean(
                torch.stack(
                    [model.model.embed_tokens.weight[old_id] for old_id in old_ids],
                    dim=0
                ),
                dim=0
            )
            model.model.embed_tokens.weight[new_id] = new_input_emb

            new_output_emb = torch.mean(
                torch.stack(
                    [model.lm_head.weight[old_id] for old_id in old_ids],
                    dim=0
                ),
                dim=0
            )
            model.lm_head.weight[new_id] = new_output_emb

replace_emb_by_original_emb_mean(new2old_index_mapping)

Hello,

Thank you for the additional information.

My follow-up question to this would be what is supposed to be used for the new2old_index_mapping.items()? I do understand the averaging code but it is still unclear to me what the input into the replace_emb_by_original_emb_mean function would be.

Again, thank you for all your help.

ELYZA.inc org
β€’
edited Sep 22, 2023

new2old_index_mapping is a variable that holds what sequence of tokens the new tokenizer token was represented by in the old tokenizer.

example:

>>> print(tokenizer_new.encode("こんにけは", add_special_tokens=False)[1:])
[41737]
>>> print(tokenizer_old.encode("こんにけは", add_special_tokens=False)[1:])
[30589, 30389, 30353, 30644, 30449]

new2old_index_mapping = {
    41737: [30589, 30389, 30353, 30644, 30449],
    ...
}

Thank you for the additional clarification.

Is the training corpus being encoded twice for building the mapping during training?

Is the code for preparing and training the models available anywhere for us to review? That way I do not have to bother you with more questions.

Do you have a specific way in which you would want to be cited for helping out with this code as well? I try to thoroughly acknowledge every individual.

Sign up or log in to comment