|
{ |
|
"chunks": [ |
|
{ |
|
"text": "Meta has once again pushed the boundaries of AI with the release of Llama 2, the highly anticipated successor to its groundbreaking Llama 1 language model. Boasting a range of cutting-edge features, Llama 2 has already disrupted the AI landscape and poses a real challenge to ChatGPT’s dominance. In this article, we will dive into the exciting world of Llama 2 and explore what makes it a true game-changer. I. Llama 2: Revolutionizing Commercial Use Unlike its predecessor Llama 1, which was limited to research use, Llama 2 represents a major advancement as an open-source commercial model. Businesses can now integrate Llama 2 into products to create AI-powered applications. Availability on Azure and AWS facilitates fine-tuning and adoption. However, restrictions apply to prevent exploitation. Companies with over 700 million active daily users cannot use Llama 2. Additionally, its output cannot be used to improve other language models.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "II. Llama 2 Model Flavors: Llama 2 is available in four different model sizes: 7 billion, 13 billion, 34 billion, and 70 billion parameters. While 7B, 13B, and 70B have already been released, the 34B model is still awaited. The pretrained variant, trained on a whopping 2 trillion tokens, boasts a context window of 4096 tokens, twice the size of its predecessor Llama 1. Meta also released a Llama 2 fine-tuned model for chat applications that was trained on over 1 million human annotations. Such extensive training comes at a cost, with the 70B model taking a staggering 1720320 GPU hours to train. The context window’s length determines the amount of content the model can process at once, making Llama 2 a powerful language model in terms of scale and efficiency.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "III. Safety Considerations: A Top Priority for Meta: Meta’s commitment to safety and alignment shines through in Llama 2’s design. The model demonstrates exceptionally low AI safety violation percentages, surpassing even ChatGPT in safety benchmarks. Source: Meta Llama 2 paper Finding the right balance between helpfulness and safety when optimizing a model poses significant challenges. While a highly helpful model may be capable of answering any question, including sensitive ones like “How do I build a bomb?”, it also raises concerns about potential misuse. Thus, striking the perfect equilibrium between providing useful information and ensuring safety is paramount.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "However, prioritizing safety to an extreme extent can lead to a model that struggles to effectively address a diverse range of questions. This limitation could hinder the model’s practical applicability and user experience. Thus, achieving an optimum balance that allows the model to be both helpful and safe is of utmost importance. To strike the right balance between helpfulness and safety, Meta employed two reward models — one for helpfulness and another for safety — to optimize the model’s responses. The 34B parameter model has reported higher safety violations than other variants, possibly contributing to the delay in its release.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "IV. Helpfulness Comparison: Llama 2 Outperforms Competitors: Llama 2 emerges as a strong contender in the open-source language model arena, outperforming its competitors in most categories. The 70B parameter model outperforms all other open-source models, while the 7B and 34B models outshine Falcon in all categories and MPT in all categories except coding. Source: Meta Llama 2 paper. Despite being smaller, Llam a2’s performance rivals that of Chat GPT 3.5, a significantly larger closed-source model. While GPT 4 and PalM-2-L, with their larger size, outperform Llama 2, this is expected due to their capacity for handling complex language tasks. Llama 2’s impressive ability to compete with larger models highlights its efficiency and potential in the market. Source: Meta Llama 2 paper. However, Llama 2 does face challenges in coding and math problems, where models like Chat GPT 4 excel, given their significantly larger size. Chat GPT 4 performed significantly better than Llama 2 for coding (HumanEval benchmark)and math problem tasks (GSM8k benchmark). Open-source AI technologies, like Llama 2, continue to advance, offering strong competition to closed-source models.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "V. Ghost Attention: Enhancing Conversational Continuity: One unique feature in Llama 2 is Ghost Attention, which ensures continuity in conversations. This means that even after multiple interactions, the model remembers its initial instructions, ensuring more coherent and consistent responses throughout the conversation. This feature significantly enhances the user experience and makes Llama 2 a more reliable language model for interactive applications. In the example below, on the left, it forgets to use an emoji after a few conversations. On the right, with Ghost Attention, even after having many conversations, it will remember the context and continue to use emojis in its response. Source: Meta Llama 2 paper.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "VI. Temporal Capability: A Leap in Information Organization: Meta reported a groundbreaking temporal capability, where the model organizes information based on time relevance. Each question posed to the model is associated with a date, and it responds accordingly by considering the event date before which the question becomes irrelevant. For example, if you ask the question, “How long ago did Barack Obama become president?”, its only relevant after 2008. This temporal awareness allows Llama 2 to deliver more contextually accurate responses, enriching the user experience further. Source: Meta Llama 2 paper.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "VII. Open Questions and Future Outlook: Meta’s open-sourcing of Llama 2 represents a seismic shift, now offering developers and researchers commercial access to a leading language model. With Llama 2 outperforming MosaicML’s current MPT models, all eyes are on how Databricks will respond. Can MosaicML’s next MPT iteration beat Llama 2? Is it worthwhile to compete with Llama 2 or join hands with the open-source community to make the open-source models better? Meanwhile, Microsoft’s move to host Llama 2 on Azure despite having significant investment in ChatGPT raises interesting questions. Will users prefer the capabilities and transparency of an open-source model like Llama 2 over closed, proprietary options? The stakes are high, as Meta’s bold democratization play stands to reshape preferences and partnerships in the AI space. One thing is certain — the era of open language model competition has begun.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "VIII. Conclusion: With the launch of Llama 2, Meta has achieved a landmark breakthrough in open-source language models, unleashing new potential through its commercial accessibility. Llama 2’s formidable capabilities in natural language processing, along with robust safety protocols and temporal reasoning, set new benchmarks for the field. While select limitations around math and coding exist presently, Llama 2’s strengths far outweigh its weaknesses. As Meta continues honing Llama technology, this latest innovation promises to be truly transformative. By open-sourcing such an advanced model, Meta is propelling democratization and proliferation of AI across industries. From healthcare to education and beyond, Llama 2 stands to shape the landscape by putting groundbreaking language modeling into the hands of all developers and researchers. The possibilities unlocked by this open-source approach signal a shift towards a more collaborative, creative AI future.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "About 2 weeks ago, the world of generative AI was shocked by the company Meta's release of the new Llama-2 AI model. Its predecessor, Llama-1, was a breaking point in the LLM industry, as with the release of its weights along with new finetuning techniques, there was a massive creation of open-source LLM models that led to the emergence of high-performance models such as Vicuna, Koala, … In this article, we will briefly discuss some of this model's relevant points but will focus on showing how we can quickly train the model for a specific task using libraries and tools standard in this world. We will not make an exhaustive analysis of the new model, there are already numerous articles published on the subject.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "New Llama-2 model: In mid-July, Meta released its new family of pre-trained and finetuned models called Llama-2, with an open source and commercial character to facilitate its use and expansion. The base model was released with a chat version and sizes 7B, 13B, and 70B. Together with the models, the corresponding papers were published describing their characteristics and relevant points of the learning process, which provide very interesting information on the subject. An updated version of Llama 1, trained on a new mix of publicly available data. The pretraining corpus size was increased by 40%, the model’s context length was doubled, and grouped-query attention was adopted. Variants with 7B, 13B, and 70B parameters are released, along with 34B variants reported in the paper but not released.[1]", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "For pre-training, 40% more tokens were used, reaching 2T, the context length was doubled and the grouped-query attention (GQA) technique was applied to speed up inference on the heavier 70B model. On the standard transformer architecture, RMSNorm normalization, SwiGLU activation, and rotatory positional embedding are used, the context length reaches 4096 tokens, and an Adam optimizer is applied with a cosine learning rate schedule, a weight decay of 0.1 and gradient clipping. The Supervised Fine-Tuning (SFT) stage is characterized by a prioritization of quality examples over quantity, as numerous reports show that the use of high-quality data results in improved final model performance. Finally, a Reinforcement Learning with Human Feedback (RLHF) step is applied to align the model with user preferences. A multitude of examples are collected where annotators select their preferred model output over a binary comparison. This data is used to train a reward model, where the focus is on helpfulness and safety.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "In short: - Trained on 2T Tokens. - Commercial use allowed. - Chat models for dialogue use cases. - 4096 default context window (can be increased). - 7B, 13B & 70B parameter version. - 70B model adopted grouped-query attention (GQA). - Chat models can use tools & plugins. - LLaMA 2-CHAT as good as OpenAI ChatGPT.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "The dataset for tuning: For our tuning process, we will take a dataset containing about 18,000 examples where the model is asked to build a Python code that solves a given task. This is an extraction of the original dataset [2], where only the Python language examples are selected. Each row contains the description of the task to be solved, an example of data input to the task if applicable, and the generated code fragment that solves the task is provided [3]. # Load dataset from the hub\ndataset = load_dataset(dataset_name, split=dataset_split)\n# Show dataset size print(f'dataset size: {len(dataset)}')\n# Show an example\nprint(dataset[randrange(len(dataset))])", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Creating the prompt To carry out an instruction fine-tuning, we must transform each one of our data examples as if it were an instruction, outlining its main sections as follows: def format_instruction(sample):\n return f'''### Instruction:\n Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task: ### Task:\n{sample['instruction']}\n### Input: \n {sample['input']}\n### Response:\n{sample['output']}\n'''Output:### Instruction: Use the Task below and the Input given to write the Response, which is a programming code that can solve the following Task: ### Task: Develop a Python program that prints 'Hello, World!' whenever it is run. ### Input: ### Response: #Python program to print 'Hello World!' print('Hello, World!')", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Fine-tuning the model\nTo carry out this stage, we have used the Google Colab environment, where we have developed a notebook that allows us to run the training in an interactive way and also a Python script to run the training in unattended mode. For the first test runs, a T4 instance with a high RAM capacity is enough, but when it comes to running the whole dataset and epochs, we have opted to use an A100 instance in order to speed up the training and ensure that its execution time is reasonable.\nIn order to be able to share the model, we will log in to the Huggingface hub using the appropriate token, so that at the end of the whole process, we will upload the model files so that they can be shared with the rest of the users.\nfrom huggingface_hub import login\nfrom dotenv import load_dotenv\nimport os\n# Load the enviroment variables\nload_dotenv()\n# Login to the Hugging Face Hub\nlogin(token=os.getenv('HF_HUB_TOKEN'))", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Fine-tuning techniques: PEFT, Lora, and QLora\nIn recent months, some papers have appeared showing how PEFT techniques can be used to train large language models with a drastic reduction of RAM requirements and consequently allowing fine-tuning of these models on a single GPU of reasonable size.\nThe usual steps to train an LLM consist, first, an intensive pre-training on billions or trillions of tokens to obtain a foundation model, and then a fine-tuning is performed on this model to specialize it on a downstream task. In this fine-tuning phase is where the PEFT technique has its purpose.\nParameter Efficient Fine-Tuning (PEFT) allows us to considerably reduce RAM and storage requirements by only fine-tuning a small number of additional parameters, with virtually all model parameters remaining frozen. PEFT has been found to produce good generalization with relatively low-volume datasets. Furthermore, it enhances the reusability and portability of the model, as the small checkpoints obtained can be easily added to the base model, and the base model can be easily fine-tuned and reused in multiple scenarios by adding the PEFT parameters. Finally, since the base model is not adjusted, all the knowledge acquired in the pre-training phase is preserved, thus avoiding catastrophic forgetting.\n", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Most widely used PEFT techniques aim to keep the pre-trained base model untouched and add new layers or parameters on top of it. These layers are called “Adapters” and the technique of their adjustment “adapter-tuning”, we add these layers to the pre-trained base model and only train the parameters of these new layers. However, a serious problem with this approach is that these layers lead to increased latency in the inference phase, which makes the process inefficient in many scenarios.\nIn the LoRa technique, a Low-Rank Adaptation of Large Language Models, the idea is not to include new layers but to add values to the parameters in a way that avoids this scary problem of latency in the inference phase. LoRa trains and stores the changes of the additional weights while freezing all the weights of the pre-trained model. Therefore, we train a new weights matrix with the changes in the pre-trained model matrix, and this new matrix is decomposed into 2 Low-rank matrices as explained here:", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Let all the parameters of a LLM in the matrix W0 and the additional weight changes in the matrix ∆W, the final weights become W0 + ∆W. The authors of LoRA [1] proposed that the change in weight change matrix ∆W can be decomposed into two low-rank matrices A and B. LoRA does not train the parameters in ∆W directly, but the parameters in A and B. So the number of trainable parameters is much less. Hypothetically suppose the dimension of A is 100 * 1 and that of B is 1 * 100, the number of parameters in ∆W will be 100 * 100 = 10000. There are only 100 + 100 = 200 to train in A and B, instead of 10000 to train in ∆W\n[4]. Explanation by Dr. Dataman in Fine-tuning a GPT — LoRA\nThe size of these low-rank matrices is defined by the r parameter. The smaller this value is, the fewer parameters to train, therefore, less effort and faster, but on the other hand, a potential loss of information and performance.\nIf you want a more detailed explanation, you can refer to the original paper, or there are plenty of articles that explain it in detail, such as [4].\nFinally, QLoRa [6] consists of applying quantization to the LoRa method allowing 4-bit normal quantization, nf4, a type optimized for normally distributed weights; double quantization to reduce the memory footprint and the optimization of the NVIDIA unified memory. These are techniques to optimize memory usage to achieve “lighter” and less expensive training.", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Implementing QLoRa in our experiment requires specifying the BitsAndBytes configuration, downloading the pretrained model in 4-bit quantization, and defining a LoraConfig. Finally, we need to retrieve the tokenizer.\n# Get the type\ncompute_dtype = getattr(torch, bnb_4bit_compute_dtype)\n# BitsAndBytesConfig int-4 config\nbnb_config = BitsAndBytesConfig(\n load_in_4bit=use_4bit,\n bnb_4bit_use_double_quant=use_double_nested_quant,\n bnb_4bit_quant_type=bnb_4bit_quant_type,\n bnb_4bit_compute_dtype=compute_dtype\n)\n# Load model and tokenizer\nmodel = AutoModelForCausalLM.from_pretrained(model_id, \n quantization_config=bnb_config, use_cache = False, device_map=device_map)\nmodel.config.pretraining_tp = 1\n# Load the tokenizer\ntokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)\ntokenizer.pad_token = tokenizer.eos_token\ntokenizer.padding_side = 'right'\nParameters defined,\n# Activate 4-bit precision base model loading\nuse_4bit = True\n# Compute dtype for 4-bit base models\nbnb_4bit_compute_dtype = 'float16'\n# Quantization type (fp4 or nf4)\nbnb_4bit_quant_type = 'nf4'\n# Activate nested quantization for 4-bit base models (double quantization)\nuse_double_nested_quant = False\n# LoRA attention dimension\nlora_r = 64\n# Alpha parameter for LoRA scaling\nlora_alpha = 16\n# Dropout probability for LoRA layers\nlora_dropout = 0.1\nAnd the next steps are well-known for all Hugging Face users, setting up the training arguments, and creating a Trainer. As we are executing an instruction fine-tuning we call to the SFTTrainer method that encapsulates the PEFT model definition and other steps.\n# Define the training arguments\nargs = TrainingArguments(\n output_dir=output_dir,\n num_train_epochs=num_train_epochs,\n per_device_train_batch_size=per_device_train_batch_size, # 6 if use_flash_attention else 4,\n gradient_accumulation_steps=gradient_accumulation_steps,\n gradient_checkpointing=gradient_checkpointing,\n optim=optim,\n logging_steps=logging_steps,\n save_strategy='epoch',\n learning_rate=learning_rate,\n weight_decay=weight_decay,\n fp16=fp16,\n bf16=bf16,\n max_grad_norm=max_grad_norm,\n warmup_ratio=warmup_ratio,\n group_by_length=group_by_length,\n lr_scheduler_type=lr_scheduler_type,\n disable_tqdm=disable_tqdm,\n report_to='tensorboard',\n seed=42\n)\n# Create the trainer\ntrainer = SFTTrainer(\n model=model,\n train_dataset=dataset,\n peft_config=peft_config,\n max_seq_length=max_seq_length,\n tokenizer=tokenizer,\n packing=packing,\n formatting_func=format_instruction,\n args=args,\n)\n# train the model\ntrainer.train() # there will not be a progress bar since tqdm is disabled\n", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "# save model in local\ntrainer.save_model()\nThe parameters can be found on my GitHub repository, most of them are commonly used in other fine-tuning scripts on LLMs and are the following ones:\n# Number of training epochs\nnum_train_epochs = 1\n# Enable fp16/bf16 training (set bf16 to True with an A100)\nfp16 = False\nbf16 = True\n# Batch size per GPU for training\nper_device_train_batch_size = 4\n# Number of update steps to accumulate the gradients for\ngradient_accumulation_steps = 1\n# Enable gradient checkpointing\ngradient_checkpointing = True\n# Maximum gradient normal (gradient clipping)\nmax_grad_norm = 0.3\n# Initial learning rate (AdamW optimizer)\nlearning_rate = 2e-4\n# Weight decay to apply to all layers except bias/LayerNorm weights\nweight_decay = 0.001\n# Optimizer to use\noptim = 'paged_adamw_32bit'\n# Learning rate schedule\nlr_scheduler_type = 'cosine' #'constant'\n# Ratio of steps for a linear warmup (from 0 to learning rate)\nwarmup_ratio = 0.03\n# Group sequences into batches with same length\n# Saves memory and speeds up training considerably\ngroup_by_length = False\n# Save checkpoint every X updates steps\nsave_steps = 0\n# Log every X updates steps\nlogging_steps = 25\n# Disable tqdm\ndisable_tqdm= True\nMerge the base model and the adapter weights\nAs we mention, we have trained “modification weights” on the base model, our final model requires merging the pretrained model and the adapters in a single model.\nfrom peft import AutoPeftModelForCausalLM\nmodel = AutoPeftModelForCausalLM.from_pretrained(\n args.output_dir,\n low_cpu_mem_usage=True,\n return_dict=True,\n torch_dtype=torch.float16,\n device_map=device_map, \n)\n# Merge LoRA and base model\nmerged_model = model.merge_and_unload()\n# Save the merged model\nmerged_model.save_pretrained('merged_model',safe_serialization=True)\ntokenizer.save_pretrained('merged_model')\n# push merged model to the hub\nmerged_model.push_to_hub(hf_model_repo)\ntokenizer.push_to_hub(hf_model_repo)\nYou can find and download the model in my Hugging Face account edumunozsala/llama-2–7b-int4-python-code-20k. Give it a try!\n", |
|
"embedding": "" |
|
}, |
|
{ |
|
"text": "Inferencing or generating Python code\nAnd finally, we will show you how you can download the model from the Hugging Face Hub and call the model to generate an accurate result:\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n# Get the tokenizer\ntokenizer = AutoTokenizer.from_pretrained(hf_model_repo)\n# Load the model\nmodel = AutoModelForCausalLM.from_pretrained(hf_model_repo, load_in_4bit=True, \n torch_dtype=torch.float16,\n device_map=device_map)\n# Create an instruction\ninstruction='Optimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2.'\ninput=''\nprompt = f'''### Instruction:\nUse the Task below and the Input given to write the Response, which is a programming code that can solve the Task.\n### Task:\n{instruction}\n### Input:\n{input}\n### Response:\n'''\n# Tokenize the input\ninput_ids = tokenizer(prompt, return_tensors='pt', truncation=True).input_ids.cuda()\n# Run the model to infere an output\noutputs = model.generate(input_ids=input_ids, max_new_tokens=100, do_sample=True, top_p=0.9,temperature=0.5)\n# Print the result\nprint(f'Prompt:\n{prompt}\n')\nprint(f'Generated instruction:\n{tokenizer.batch_decode(outputs.detach().cpu().numpy(), skip_special_tokens=True)[0][len(prompt):]}')\nPrompt:\n### Instruction:\nUse the Task below and the Input given to write the Response, which is a programming code that can solve the Task.\n### Task:\nOptimize a code snippet written in Python. The code snippet should create a list of numbers from 0 to 10 that are divisible by 2.\n### Input:\narr = []\nfor i in range(10):\n if i % 2 == 0:\n arr.append(i)\n### Response:\nGenerated instruction:\narr = [i for i in range(10) if i % 2 == 0]\nGround truth:\narr = [i for i in range(11) if i % 2 == 0]\nThanks to Maxime Labonne for an excellent article [9] and Philipp Schmid who provides an inspiring code [8]. Their articles are a must-read for everyone interested in Llama 2 and model fine-tuning.\nAnd it is all I have to mention, I hope you find useful this article and claps are welcome!! You can Follow me and Subscribe to my articles, or even connect to me via Linkedin. The code is available in my Github Repository.", |
|
"embedding": "" |
|
} |
|
] |
|
} |