can you provie the data process demo before train llms?

#7
by scall - opened

data process demo

    def generate_and_tokenize_prompt2(examples, CUTOFF_LEN=SEQ_LEN):
        instruction, input_text = examples['instruction'].strip(),examples['input'].strip()
        if len(input_text)==0:
            user_prompt = f"User:{instruction}\n###{input_text}\n\nAssistant: \n"
        else:
            user_prompt = f"User:{instruction}\n\nAssistant: \n"
        len_user_prompt_tokens = (len(tokenizer(user_prompt,truncation=True,max_length=CUTOFF_LEN + 1,)["input_ids"])- 1)  # no eos token
        full_tokens = tokenizer(user_prompt + examples["output"],truncation=True,max_length=CUTOFF_LEN + 1,padding="max_length")["input_ids"][:-1]
        labels = [-100]*len_user_prompt_tokens + [id if id!=0 else -100 for id in full_tokens[len_user_prompt_tokens:]  ]
        return {"input_ids": full_tokens,"labels": torch.LongTensor(labels) ,"attention_mask": torch.LongTensor([1] * len(full_tokens)) }

I use my data process to instruct tuning manticore-13b, the train loss initially very large, such as 78.0, but your train loss is very small ( https://wandb.ai/wing-lian/manticore-13b/runs/nq3u3uoh/workspace)

Open Access AI Collective org

how high is your learning rate set to?

how high is your learning rate set to?

single batch size is 48, runing in single node with 8 gpus, learning rate is 1e-5, warm up steps is 200

Sign up or log in to comment