finetune_starcoder2_with_Ruby_Data

This model is a fine-tuned version of bigcode/starcoder2-3b on thebigcode/the-stack-smol dataset.

Model description

This fine-tuned model builds upon the bigcode/starcoder2-3b base model, further specializing it for code completion tasks using the bigcode/the-stack-smol dataset on Ruby data. This dataset focuses on code snippets and solutions, allowing the model to suggest relevant completions and potentially even generate code based on your prompts.

Intended uses & limitations

Ruby Code Generator is a versatile tool crafted to streamline the interaction between developers and Ruby codebases. Here are some of its primary applications:

Novice programmers engaging with Ruby codebases: Individuals new to programming or those unfamiliar with Ruby syntax can harness this tool to express commands or queries in natural language and receive corresponding Ruby code snippets. This facilitates access to and manipulation of data within Ruby environments without necessitating extensive programming knowledge.
Code exploration and analysis: Developers, analysts, or researchers can utilize the Ruby Code Generator to swiftly construct code snippets for exploratory analysis or debugging purposes. By automating the generation of basic Ruby code segments, users can dedicate more time to refining their inquiries and comprehending the outcomes.
Automation of repetitive coding tasks: Tasks requiring the recurrent execution of similar Ruby code segments with variable parameters can benefit from the automation capabilities of the Ruby Code Generator. This functionality enhances productivity and diminishes the likelihood of errors stemming from manual code generation.
Learning Ruby programming: Beginners can employ the Ruby Code Generator to experiment with natural language prompts and observe the corresponding Ruby code outputs. This serves as an invaluable educational tool for grasping the fundamentals of Ruby syntax and its application, facilitating an intuitive understanding of programming concepts."

Training procedure

1. Load Dataset and Model:

Load the bigcode/the-stack-smol dataset using the Hugging Face Datasets library.
Filter for the specified subset (data/ruby) and split (train).
Load the bigcode/starcoder2-3b model from the Hugging Face Hub with '4-bit' quantization.

2. Data Preprocessing:

Tokenize the code text using the appropriate tokenizer for the chosen model.
Apply necessary cleaning or normalization (e.g., removing comments, handling indentation).
Create input examples suitable for the model's architecture (e.g., with masked language modeling objectives).

3. Configure Training:

Initialize a Trainer object (likely from a library like Transformers).
Set training arguments based on the provided args:
- Learning rate, optimizer, scheduler
- Gradient accumulation steps
- Weight decay
- Loss function (likely cross-entropy)
- Evaluation metrics (e.g., accuracy, perplexity)
- Device placement (GPU/TPU)
- Number of processes for potential distributed training

4. Train the Model:

Start the training loop for the specified max_steps.
Iterate through batches of preprocessed code examples.
Forward pass through the model to generate predictions.
Calculate loss based on ground truth and predictions.
Backpropagate gradients to update model parameters.

6. Save the Fine-tuned Model:

Save the model's weights and configuration to the output_dir.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 2
eval_batch_size: 16
seed: 0
gradient_accumulation_steps: 4
total_train_batch_size: 8
optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 100
training_steps: 1000
mixed_precision_training: Native AMP

Training results

Framework versions

PEFT 0.8.2
Transformers 4.40.0.dev0
Pytorch 2.1.2
Datasets 2.16.1
Tokenizers 0.15.2

arvindkaphley
/

finetune_starcoder2_with_Ruby_Data