--- base_model: FreedomIntelligence/AceGPT-7B-chat inference: false license: llama2 model_creator: FreedomIntelligence model_name: AceGPT 7B chat model_type: llama2 quantized_by: MohamedRashad datasets: - FreedomIntelligence/Arabic-Vicuna-80 - FreedomIntelligence/Arabic-AlpacaEval - FreedomIntelligence/MMLU_Arabic - FreedomIntelligence/EXAMs - FreedomIntelligence/ACVA-Arabic-Cultural-Value-Alignment language: - en - ar library_name: transformers ---
# AceGPT 7B Chat - AWQ - Model creator: [FreedomIntelligence](https://huggingface.co/FreedomIntelligence) - Original model: [AceGPT 7B Chat](https://huggingface.co/FreedomIntelligence/AceGPT-7B-chat) ## Description This repo contains AWQ model files for [FreedomIntelligence's AceGPT 7B Chat](https://huggingface.co/FreedomIntelligence/AceGPT-7B-chat). In my effort of making Arabic LLms Available for consumers with simple GPUs I have Quantized two important models: - [AceGPT 13B Chat AWQ](https://huggingface.co/MohamedRashad/AceGPT-13B-chat-AWQ) - [AceGPT 7B Chat AWQ](https://huggingface.co/MohamedRashad/AceGPT-7B-chat-AWQ) **(We are Here)** ### About AWQ AWQ is an efficient, accurate and blazing-fast low-bit weight quantization method, currently supporting 4-bit quantization. Compared to GPTQ, it offers faster Transformers-based inference with equivalent or better quality compared to the most commonly used GPTQ settings. It is supported by: - [Text Generation Webui](https://github.com/oobabooga/text-generation-webui) - using Loader: AutoAWQ - [vLLM](https://github.com/vllm-project/vllm) - Llama and Mistral models only - [Hugging Face Text Generation Inference (TGI)](https://github.com/huggingface/text-generation-inference) - [Transformers](https://huggingface.co/docs/transformers) version 4.35.0 and later, from any code or client that supports Transformers - [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) - for use from Python code ## Prompt template: Unknown ``` [INST] <>\nأنت مساعد مفيد ومحترم وصادق. أجب دائما بأكبر قدر ممكن من المساعدة بينما تكون آمنا. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو جنسي أو سام أو خطير أو غير قانوني. يرجى التأكد من أن ردودك غير متحيزة اجتماعيا وإيجابية بطبيعتها.\n\nإذا كان السؤال لا معنى له أو لم يكن متماسكا من الناحية الواقعية، اشرح السبب بدلا من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة سؤال ما، فيرجى عدم مشاركة معلومات خاطئة.\n<>\n\n [INST] {prompt} [/INST] ``` ## Inference from Python code using Transformers ### Install the necessary packages - Requires: [Transformers](https://huggingface.co/docs/transformers) 4.35.0 or later. - Requires: [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) 0.1.6 or later. ```shell pip3 install --upgrade "autoawq>=0.1.6" "transformers>=4.35.0" ``` Note that if you are using PyTorch 2.0.1, the above AutoAWQ command will automatically upgrade you to PyTorch 2.1.0. If you are using CUDA 11.8 and wish to continue using PyTorch 2.0.1, instead run this command: ```shell pip3 install https://github.com/casper-hansen/AutoAWQ/releases/download/v0.1.6/autoawq-0.1.6+cu118-cp310-cp310-linux_x86_64.whl ``` If you have problems installing [AutoAWQ](https://github.com/casper-hansen/AutoAWQ) using the pre-built wheels, install it from source instead: ```shell pip3 uninstall -y autoawq git clone https://github.com/casper-hansen/AutoAWQ cd AutoAWQ pip3 install . ``` ### Transformers example code (requires Transformers 4.35.0 and later) ```python from transformers import AutoModelForCausalLM, AutoTokenizer, TextStreamer model_name_or_path = "MohamedRashad/AceGPT-7B-chat-AWQ" tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, padding_side="right") model = AutoModelForCausalLM.from_pretrained( model_name_or_path, use_flash_attention_2=True, # disable if you have problems with flash attention 2 torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto" ) # Using the text streamer to stream output one token at a time streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) prompt = "ما أجمل بيت شعر فى اللغة العربية ؟" prompt_template=f'''[INST] <>\nأنت مساعد مفيد ومحترم وصادق. أجب دائما بأكبر قدر ممكن من المساعدة بينما تكون آمنا. يجب ألا تتضمن إجاباتك أي محتوى ضار أو غير أخلاقي أو عنصري أو جنسي أو سام أو خطير أو غير قانوني. يرجى التأكد من أن ردودك غير متحيزة اجتماعيا وإيجابية بطبيعتها.\n\nإذا كان السؤال لا معنى له أو لم يكن متماسكا من الناحية الواقعية، اشرح السبب بدلا من الإجابة على شيء غير صحيح. إذا كنت لا تعرف إجابة سؤال ما، فيرجى عدم مشاركة معلومات خاطئة.\n<>\n\n [INST] {prompt} [/INST] ''' # Convert prompt to tokens tokens = tokenizer( prompt_template, return_tensors='pt' ).input_ids.cuda() generation_params = { "do_sample": True, "temperature": 0.7, "top_p": 0.95, "top_k": 40, "max_new_tokens": 512, "repetition_penalty": 1.1 } # Generate streamed output, visible one token at a time generation_output = model.generate( tokens, streamer=streamer, **generation_params ) # Generation without a streamer, which will include the prompt in the output generation_output = model.generate( tokens, **generation_params ) # Get the tokens from the output, decode them, print them token_output = generation_output[0] text_output = tokenizer.decode(token_output) print("model.generate output: ", text_output) # Inference is also possible via Transformers' pipeline from transformers import pipeline pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, **generation_params ) pipe_output = pipe(prompt_template)[0]['generated_text'] print("pipeline output: ", pipe_output) ``` ## How AWQ Quantization happened ? ```python from awq import AutoAWQForCausalLM from transformers import AutoTokenizer, AutoModelForCausalLM model_path = "FreedomIntelligence/AceGPT-7B-chat" quant_path = "AceGPT-7B-chat-AWQ" quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4, "version": "GEMM"} load_config = { "low_cpu_mem_usage": True, "device_map": "auto", "trust_remote_code": True, } # Load model model = AutoAWQForCausalLM.from_pretrained(model_path, **load_config) tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True) # Quantize model.quantize(tokenizer, quant_config=quant_config) # Save quantized model model.save_quantized(quant_path) tokenizer.save_pretrained(quant_path) # Load quantized model model = AutoModelForCausalLM.from_pretrained(quant_path) tokenizer = AutoTokenizer.from_pretrained(quant_path) # Push to hub model.push_to_hub(quant_path) tokenizer.push_to_hub(quant_path) ```