Edit model card

Llama3-70b-Instruct-4bit

This model is a quantized version of meta-llama/Meta-Llama-3-70B-Instruct

Libraries to Install

  • pip install transformers torch

Authentication needed before running the script

Run the following command in the terminal/jupyter_notebook:

  • Terminal: huggingface-cli login

  • Jupyter_notebook:

    >>> from huggingface_hub import notebook_login
    >>> notebook_login()
    

NOTE: Copy and Paste the token from your Huggingface Account Settings > Access Tokens > Create a new token / Copy the existing one.

Script

>>> from transformers import AutoTokenizer, AutoModelForCausalLM
>>> import torch

>>> # Load model and tokenizer
>>> model_id = "screevoai/llama3-70b-instruct-4bit"
>>> tokenizer = AutoTokenizer.from_pretrained(model_id)

>>> model = AutoModelForCausalLM.from_pretrained(
>>>    model_id,
>>>    torch_dtype=torch.bfloat16,
>>>    device_map="cuda:0"
>>> )

>>> # message
>>> messages = [
>>>     {"role": "system", "content": "You are a personal assistant chatbot, so respond accordingly"},
>>>     {"role": "user", "content": "What is Machine Learning?"},
>>> ]

>>> input_ids = tokenizer.apply_chat_template(
>>>     messages,
>>>     add_generation_prompt=True,
>>>     return_tensors="pt"
>>> ).to(model.device)

>>> terminators = [
>>>     tokenizer.eos_token_id,
>>>     tokenizer.convert_tokens_to_ids("<|eot_id|>")
>>> ]

>>> # Generate predictions using the model
>>> outputs = model.generate(
>>>    input_ids,
>>>    max_new_tokens=512,
>>>    eos_token_id=terminators,
>>>    do_sample=True,
>>>    temperature=0.6,
>>>    top_p=0.9,
>>> )
>>> response = outputs[0][input_ids.shape[-1]:]

>>> print(tokenizer.decode(response, skip_special_tokens=True))
Downloads last month
924
Safetensors
Model size
37.4B params
Tensor type
F32
BF16
U8
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Quantized from