Text Generation
Transformers
English
llama
Inference Endpoints

Thank you

#3
by anon8231489123 - opened

Thank you so much for being the first to do this, seriously. Can you post the hyperparameters you used for training?

You're welcome, and many thanks for leading the work on the dataset.

The hyperparameters are same as they were in the 1.0 snapshot of the FastChat repository, except for the modifications to train on 40GB A100s (mainly gradient accumulation steps and such):

    ..... (most setup specific items removed)
    --num_train_epochs 3 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps $((128 * 512 / 2048 / 2 / 4 / 4)) \
    --evaluation_strategy "no" \
    --save_strategy "steps" \
    --save_steps 1200 \
    --save_total_limit 10 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 1 \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' \
    --tf32 False \
    --model_max_length 2048 \
    --gradient_checkpointing True \
    --lazy_preprocess True

I will probably try add --fsdp "shard_grad_op auto_wrap" also for the next run, since there are reports it could help make it a bit faster.

Yes, thanks! And thanks for going straight to ggml version -- long live llama.cpp!!

By the way, any reason for not going with the 1.1 version? I assume you already started before it was released but maybe there was another reason?

Thanks for training this. Seems good! Re-quantized the f16 bin (thanks for providing it as well!) with https://github.com/ggerganov/llama.cpp/pull/896 as it seems to improve outputs/perplexity. Side note: I suggest editing the pull so bool useNewQuantization = false; is bool useNewQuantization = true; and/or case 2: quantized_type = GGML_TYPE_Q4_0; break; is case 2: quantized_type = GGML_TYPE_Q4_0; useNewQuantization = true; break; so you can use 2 with quantize. Otherwise the model will be written with ftype 4 and the model will report as mostly q4_1/some f16 when it isn't.

will this work with Llamacpp ?

Also, what do you reccomend for best results in the prompts text file? I currently use:

You are an AI language model designed to assist the Human by answering their questions, offering advice, and engaging in casual conversation in a friendly, helpful, and informative manner. You respond clearly, coherently, and you consider the conversation history.

Human: Hey, how's it going?

Assistant: Hey there! I'm doing great, thank you. What can I help you with today? Let's have a fun chat!

With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)

Thank you!!

@spanielrassler Yeah, 1.1 dropped while 1.0 was already in the training and two epochs in. Bummer, but the next version will be 1.1.

With v 1.0 it's best to use "### Assistant:"and "### Human:" while what you listed in your message is the format for v1.1 of this model. (Correct me if I'm wrong!). And yes, it's working with llama.cpp for me :)

That stopped the repeating but it now seems to default to censored answeres again? Do you think you might be able to do an uncensored version of " vicuna-13B-1.1-GPTQ-4bit-32g.GGML.bin" ??

https://huggingface.co/TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g-GGML

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)

Please don't focus only on ggml only, some of us still use graphics cards for LLM, too. :)

yes, sorry and I agree but the large majority of people will be able to buy more ram than a 24gb GPU ;)

absolutely correct. Good things come to those who wait. Thanks for the model btw!

Sign up or log in to comment