Average of vectors

by conceptofmind - opened Sep 20, 2023

Sep 20, 2023

Thank you for your great work.

I was wondering if you could point to reference code for this:

"Therefore, we decided to use the average of the vectors corresponding to the original tokens of embed_tokensand each vector) as the initial value of the vectors corresponding to the added tokens (e.g., the vector of ). "

I greatly appreciate your help.

tyoyo

ELYZA.inc org Sep 21, 2023

•

edited Sep 22, 2023

Thanks for your interest in our model.

We took the average of the vectors as our initial value with the following code:

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16)

def replace_emb_by_original_emb_mean(new2old_index_mapping):
    # new2old_index_mapping: {new_id1: [old_id_1_1, old_id_1_2, ...], new_id2: [old_id_2_1, ...], ...}
    # input_emb: model.model.embed_tokens.weight
    # output_emb: model.lm_head.weight
    with torch.no_grad():
        for new_id, old_ids in new2old_index_mapping.items():
            new_input_emb = torch.mean(
                torch.stack(
                    [model.model.embed_tokens.weight[old_id] for old_id in old_ids],
                    dim=0
                ),
                dim=0
            )
            model.model.embed_tokens.weight[new_id] = new_input_emb

            new_output_emb = torch.mean(
                torch.stack(
                    [model.lm_head.weight[old_id] for old_id in old_ids],
                    dim=0
                ),
                dim=0
            )
            model.lm_head.weight[new_id] = new_output_emb

replace_emb_by_original_emb_mean(new2old_index_mapping)

conceptofmind

Sep 22, 2023

Hello,

Thank you for the additional information.

My follow-up question to this would be what is supposed to be used for the new2old_index_mapping.items()? I do understand the averaging code but it is still unclear to me what the input into the replace_emb_by_original_emb_mean function would be.

Again, thank you for all your help.

tyoyo

ELYZA.inc org Sep 22, 2023

•

edited Sep 22, 2023

new2old_index_mapping is a variable that holds what sequence of tokens the new tokenizer token was represented by in the old tokenizer.

example:

>>> print(tokenizer_new.encode("こんにちは", add_special_tokens=False)[1:])
[41737]
>>> print(tokenizer_old.encode("こんにちは", add_special_tokens=False)[1:])
[30589, 30389, 30353, 30644, 30449]

new2old_index_mapping = {
    41737: [30589, 30389, 30353, 30644, 30449],
    ...
}

conceptofmind

Sep 22, 2023

•

edited Sep 25, 2023

Thank you for the additional clarification.

Is the training corpus being encoded twice for building the mapping during training?

Is the code for preparing and training the models available anywhere for us to review? That way I do not have to bother you with more questions.

conceptofmind

Sep 24, 2023

•

edited Sep 24, 2023

Do you have a specific way in which you would want to be cited for helping out with this code as well? I try to thoroughly acknowledge every individual.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment