Open-Source AI Cookbook documentation
π GitHub Tag Generator with T5 + PEFT (LoRA)
π GitHub Tag Generator with T5 + PEFT (LoRA)
Authored by: Zamal Babar
In this notebook, we walk through a complete end-to-end implementation of a lightweight, fast, and open-source GitHub tag generator using T5-small fine-tuned on a custom dataset with PEFT (LoRA). This tool can automatically generate relevant tags from a GitHub repository description or summary β useful for improving discoverability and organizing repos more intelligently.
π‘ Use Case
Imagine youβre building a tool that helps users explore GitHub repositories more effectively. Instead of relying on manually written or sometimes missing tags, we train a model that automatically generates descriptive tags for any GitHub project. This could help:
- Improve search functionality
- Automatically tag new repos
- Build better filters for discovery
π¦ Dataset
We use a dataset of GitHub project descriptions and their associated tags. Each training example contains:
"input"
: A natural language description of a GitHub repository"target"
: A comma-separated list of relevant tags
The dataset was initially loaded from a local .jsonl
file, but is now also available on the Hugging Face Hub here:
β‘οΈ zamal/github-meta-data
π§ Model Architecture
We fine-tuned the T5-small
model for this task β a lightweight encoder-decoder transformer thatβs well-suited for text-to-text generation tasks.
To make fine-tuning faster and more efficient, we used the π€ peft
library with LoRA (Low-Rank Adaptation) to update only a subset of model parameters.
β What This Notebook Covers
This notebook includes:
- β Loading and preprocessing a custom dataset
- β Setting up a T5-small model with LoRA
- β
Training the model using the Hugging Face
Trainer
- β Monitoring progress with Weights & Biases
- β Saving and pushing the model to the Hugging Face Hub
- β Performing inference and postprocessing for clean, deduplicated tags
π Final Outcome
By the end of this notebook, youβll have:
- π A fully trained and hosted GitHub tag generator
- π A deployable and shareable model on Hugging Face Hub
- π§ An inference function to use your model anywhere with just a few lines of code
Letβs dive in! π―
We begin by:
- Importing essential libraries for model training (
transformers
,datasets
,peft
) - Loading the T5 tokenizer
- Setting the Hugging Face token (stored securely in Colabβs
userdata
)
Make sure youβve stored your HUGGINGFACE_TOKEN
in your Colabβs secrets before running this cell.
from google.colab import userdata
import os
os.environ["HUGGINGFACE_TOKEN"] = userdata.get("HUGGINGFACE_TOKEN")
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
import os
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftConfig
tokenizer = T5Tokenizer.from_pretrained("t5-small")
π¦ Load and Prepare the Dataset
We now load our training data from a local JSONL file that contains repository descriptions and their corresponding tags.
Each line in the file is a JSON object with two fields:
input
: a short repository descriptiontarget
: the tags (comma-separated)
We split this dataset into training and validation sets using a 90/10 ratio.
π Note: When this notebook was initially run, the dataset was loaded locally from a file. However, the same dataset is now also available on the Hugging Face Hub here: zamal/github-meta-data. Feel free to load it directly using load_dataset("zamal/github-meta-data")
in your workflow as shown below.
from datasets import load_dataset, DatasetDict
# Load existing dataset with only a "train" split
dataset = load_dataset("zamal/github-meta-data") # returns DatasetDict
# Split the train set into train and validation
split = dataset["train"].train_test_split(test_size=0.1, seed=42)
# Wrap into a new DatasetDict
dataset_dict = DatasetDict({"train": split["train"], "validation": split["test"]})
>>> print(len(dataset_dict["train"]))
>>> print(len(dataset_dict["validation"]))
552 62
π€ Load the Tokenizer
We load the tokenizer associated with the t5-small
model. T5 expects input and output text to be tokenized in a specific way, and this tokenizer ensures compatibility during training and inference.
from transformers import AutoTokenizer
model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
π§Ή Preprocessing the Dataset
Next, we define a preprocessing function to tokenize both the inputs and the targets using the T5 tokenizer.
- The inputs are padded and truncated to a maximum length of 128 tokens.
- The target labels (i.e., tags) are also tokenized with a shorter maximum length of 64 tokens.
We then map this preprocessing function across our training and validation datasets and format the output for PyTorch compatibility. This prepares the dataset for training.
def preprocess(batch):
inputs = batch["input"]
targets = batch["target"]
model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length").input_ids
model_inputs["labels"] = labels
return model_inputs
tokenized = dataset_dict.map(preprocess, batched=True, remove_columns=dataset_dict["train"].column_names)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])
Loading the Base T5 Model
We load the base T5 model (t5-small
) for conditional generation. This model serves as the backbone for our tag generation task, where the goal is to generate relevant tags given a description of a GitHub repository.
model = T5ForConditionalGeneration.from_pretrained(model_name)
π§ Preparing the LoRA Configuration
We configure LoRA (Low-Rank Adaptation) to fine-tune the T5 model efficiently. LoRA injects trainable low-rank matrices into attention layers, significantly reducing the number of trainable parameters while maintaining performance.
In this setup:
r=16
defines the rank of the update matrices.lora_alpha=32
scales the updates.- We apply LoRA to the
"q"
and"v"
attention projection modules. - The task type is set to
"SEQ_2_SEQ_LM"
since weβre working on a sequence-to-sequence task.
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q", "v"], # Adjust based on model architecture
lora_dropout=0.05,
bias="none",
task_type="SEQ_2_SEQ_LM",
)
π Injecting LoRA into the Base T5 Model
Now that weβve defined our LoRA configuration, we apply it to the base T5 model using get_peft_model()
. This wraps the original model with the LoRA adapters, allowing us to fine-tune only a small number of parameters instead of the entire modelβmaking training faster and more memory-efficient.
model = get_peft_model(model, lora_config)
π οΈ TrainingArguments Configuration
We use the TrainingArguments
class to define the hyperparameters and training behavior for our model. Hereβs a breakdown of each parameter:
output_dir="./t5_tag_generator"
Directory to save model checkpoints and training logs.per_device_train_batch_size=8
Number of training samples per GPU/TPU core (or CPU) in each training step.per_device_eval_batch_size=8
Number of evaluation samples per GPU/TPU core (or CPU) in each evaluation step.learning_rate=1e-4
Initial learning rate. A good starting point for T5 models with LoRA.num_train_epochs=25
Total number of training epochs. This is relatively high to ensure convergence for our use case.logging_steps=10
How often (in steps) to log training metrics to the console and W&B.eval_strategy="steps"
Run evaluation everyeval_steps
instead of after every epoch.eval_steps=50
Evaluate the model every 50 steps to monitor progress during training.save_steps=50
Save model checkpoints every 50 steps for redundancy and safe restoration.save_total_limit=2
Keep only the 2 most recent model checkpoints to save disk space.fp16=True
Enable mixed precision training (faster and memory-efficient on supported GPUs).push_to_hub=True
Automatically push the trained model to the Hugging Face Hub.hub_model_id="zamal/github-tag-generatorr"
The model repo name on Hugging Face under your username. This is where checkpoints and final model weights will be pushed.hub_token=os.environ['HUGGINGFACE_TOKEN']
Token to authenticate your Hugging Face account. We securely retrieve this from the environment.
This setup ensures a balance between training efficiency, frequent monitoring, and safe saving of model progress.
training_args = TrainingArguments(
output_dir="./t5_tag_generator",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=1e-4,
num_train_epochs=25,
logging_steps=10,
eval_strategy="steps",
eval_steps=50,
save_steps=50,
save_total_limit=2,
fp16=True,
push_to_hub=True,
hub_model_id="zamal/github-tag-generatorr", # Replace with your Hugging Face username
hub_token=os.environ["HUGGINGFACE_TOKEN"],
)
π§ Initialize the Trainer
We now configure the Trainer
, which abstracts away the training loop, evaluation steps, logging, and saving. It handles all of it for us using the parameters weβve defined earlier.
We also pass in the DataCollatorForSeq2Seq
, which ensures proper padding and batching during training and evaluation for sequence-to-sequence tasks like ours.
β οΈ Warnings Explained:
FutureWarning: 'tokenizer' is deprecated...
As of Transformers v5.0.0, thetokenizer
argument inTrainer
is deprecated. Instead, Hugging Face recommends using theprocessing_class
, which refers to a processor that combines tokenization and potentially feature extraction. For now, itβs safe to ignore this, but itβs good practice to track deprecations for future compatibility.No label_names provided for model class 'PeftModelForSeq2SeqLM'
This warning appears because weβre using a PEFT (Parameter-Efficient Fine-Tuning) wrapped model (PeftModelForSeq2SeqLM
), and theTrainer
cannot automatically determine the label field names in this case.
Since weβre already formatting our dataset correctly (by explicitly settinglabels
during preprocessing), this warning can be safely ignored as well β training will still proceed correctly.
Now, we can initialize our Trainer
:
from transformers import Trainer
from transformers import DataCollatorForSeq2Seq
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized["train"],
eval_dataset=tokenized["validation"],
tokenizer=tokenizer,
data_collator=data_collator,
)
π Start Training the Tag Generator Model
With everything set up β the model, tokenizer, dataset, LoRA configuration, training arguments, and the Trainer
β we can now kick off the fine-tuning process by calling trainer.train()
.
This will:
- Fine-tune our T5 model using the parameter-efficient LoRA strategy.
- Save checkpoints at regular intervals (
save_steps=50
). - Evaluate on the validation set every 50 steps (
eval_steps=50
). - Log metrics like loss to Weights & Biases or the Hugging Face Hub if integrated.
Training will take some time depending on the size of your dataset and GPU, but youβll start to see metrics printed out step-by-step, such as:
Training Loss
: how well the model is fitting the training data.Validation Loss
: how well the model performs on unseen data.
Letβs begin the fine-tuning! π
>>> trainer.train()
Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·
β Training Summary and Observations
The training process successfully completed over 25 epochs, using a LoRA-fine-tuned T5-small
model to generate tags for GitHub repository descriptions. Hereβs a quick breakdown of what happened and how to interpret it:
π Logging with Weights & Biases (W&B)
We logged all training metrics and artifacts using Weights & Biases, which offers a convenient UI to monitor model performance in real time. You can view the run at: π W&B Project Run
π Training & Validation Loss
From the logs:
- Training loss began at 8.9 (random init) and steadily declined to ~1.06.
- Validation loss also dropped consistently from 7.9 to 0.95, indicating good generalization and minimal overfitting.
The slight fluctuations (e.g., at steps 850, 1000, 1100) are normal and reflect natural variance in optimization, especially with small batch sizes.
βοΈ Warnings and Notices
- The warning about
past_key_values
being deprecated is safe to ignore for now and expected behavior with the currenttransformers
version. UserWarning
about tensor creation can be optimized later, but doesnβt affect the result.- The
run_name
warning suggests you can optionally decouple logging folder names from output directories.
π Performance Metrics
The model completed:
- 1725 training steps
- ~1.9 samples/sec processing speed
- Total training time: ~2 hours
This is solid performance given the setup and confirms that your LoRA fine-tuning pipeline is both stable and efficient.
Next, weβll save and push this trained model to the Hugging Face Hub so you (or others!) can load and test it anytime. π
π Inference: Generate Tags from Repository Descriptions
Now that the model is trained, we define a simple helper function generate_tags
to run inference. It takes a natural language query describing a repository and generates relevant tags using our fine-tuned T5 model.
Below is an example for a query related to image augmentation and no-code tools.
import torch
def generate_tags(query, model, tokenizer, max_length=64, num_beams=5):
model.eval()
inputs = tokenizer(query, return_tensors="pt", truncation=True, padding="max_length", max_length=128).to(
model.device
)
with torch.no_grad():
output = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=max_length,
num_beams=num_beams,
early_stopping=True,
decoder_start_token_id=tokenizer.pad_token_id, # π required for T5
)
return tokenizer.decode(output[0], skip_special_tokens=True)
generate_tags("looking for repositories on image augmentation no code implementations", model, tokenizer)
πΎ Save Fine-Tuned Model Locally
Once the training is complete, we save the fine-tuned model and tokenizer to a local directory. This allows us to reuse or share the model later without needing to retrain it.
>>> # Save model, tokenizer, and config to local output directory
>>> model_path = "./t5_tag_generator/final"
>>> model.save_pretrained(model_path)
>>> tokenizer.save_pretrained(model_path)
>>> print("β
Model and tokenizer saved locally at:", model_path)
β Model and tokenizer saved locally at: ./t5_tag_generator/final
π Push Model to Hugging Face Hub
After saving the model locally, we now push it to the Hugging Face Hub so that others can easily access, test, and load it using from_pretrained
.
β‘οΈ The model is publicly available at: huggingface.co/zamal/github-tag-generatorr
>>> # Push to Hugging Face Hub under your repo
>>> from huggingface_hub import HfApi
>>> api = HfApi()
>>> api.upload_folder(
... folder_path=model_path,
... repo_id="zamal/github-tag-generatorr", # Your model ID
... repo_type="model",
... path_in_repo="", # Root of the repo
... )
>>> print("π Model pushed to Hugging Face Hub: https://huggingface.co/zamal/github-tag-generatorr")
π Model pushed to Hugging Face Hub: https://huggingface.co/zamal/github-tag-generatorr
π¦ Load Model Directly from Hugging Face Hub
Now that weβve pushed our fine-tuned model to the Hugging Face Hub, we can easily load it from anywhere using the pipeline
utility. This allows us to instantly test or integrate the model into other applications without needing local files.
The model is hosted at: zamal/github-tag-generatorr
from transformers import pipeline
# Load the model and tokenizer from Hugging Face Hub
tag_generator = pipeline(
"text2text-generation", model="zamal/github-tag-generatorr", tokenizer="zamal/github-tag-generatorr"
)
π§ Inference Function with Post-Processing
This function wraps the model inference process to generate tags for a given GitHub project description. We prepend the prefix "generate tags: "
(which the model was trained on) and tokenize the input appropriately before calling model.generate()
.
After decoding the generated output, we deduplicate the tags using a simple dict.fromkeys()
trick. This ensures that tags like "pytorch, pytorch, pytorch"
only appear once.
We added this logic because the training data included some noisy samples with repeated or inconsistent tags. Since we did not perform extensive data cleaning or multiple training runs to refine the quality, this lightweight fix helps improve the final output. In a production-grade system, weβd recommend:
- more rigorous data preprocessing,
- filtering weak labels,
- and performing iterative fine-tuning with evaluation and human-in-the-loop review.
def generate_tags(text, model, tokenizer, max_length=64, num_beams=5):
input_text = text
inputs = tokenizer(input_text, return_tensors="pt", padding="max_length", truncation=True, max_length=128).to(
model.device
)
with torch.no_grad():
output = model.generate(
input_ids=inputs["input_ids"],
attention_mask=inputs["attention_mask"],
max_length=max_length,
num_beams=num_beams,
early_stopping=True,
decoder_start_token_id=tokenizer.pad_token_id,
)
decoded = tokenizer.decode(output[0], skip_special_tokens=True)
# Deduplicate and clean tags
tags = [t.strip().lower() for t in decoded.split(",")]
unique_tags = list(dict.fromkeys(tags)) # preserve order + remove duplicates
return ", ".join(unique_tags)
π Real-world Examples: Testing on Sample Inputs
Now that weβve defined our inference function and loaded the model, letβs run it on a few example descriptions.
Each input represents a short summary of a hypothetical GitHub repository. Our goal is to generate meaningful and concise tags using the fine-tuned T5 model.
These test cases demonstrate how well the model generalizes to realistic prompts β and thanks to our post-processing, any repetitive or noisy tags are cleaned up before display.
>>> inputs = [
... "Need an AI tool to convert customer voice calls into structured CRM record",
... "How to train a text summarization model using Pegasus or BART",
... "Fine-tuning BERT for spam detection in emails",
... ]
>>> for text in inputs:
... print(f"π₯ Input: {text}")
... print(f"π·οΈ Tags: {generate_tags(text, model, tokenizer)}\n")
π₯ Input: Need an AI tool to convert customer voice calls into structured CRM record π·οΈ Tags: voice-calls, crm-recording, voice-recording π₯ Input: How to train a text summarization model using Pegasus or BART π·οΈ Tags: text summarization, pegasus, bart, et al. π₯ Input: Fine-tuning BERT for spam detection in emails π·οΈ Tags: bert, spam-detecting, email-tuning
π Inference Examples Using Real Hugging Face Projects
To follow the same format as our training data, we rephrase descriptive statements into query-style inputs β just like users would naturally search for repositories. This aligns with our fine-tuning data, which was based on natural language search queries mapped to relevant tags.
Below are some meta and practical examples, including:
- Hugging Faceβs own popular repositories (e.g., Transformers, Datasets, Diffusers)
- Styled as realistic queries for better inference consistency
>>> from transformers import pipeline
>>> import torch
>>> # Load the model and tokenizer from the Hugging Face Hub
>>> tag_generator = pipeline(
... "text2text-generation", model="zamal/github-tag-generatorr", tokenizer="zamal/github-tag-generatorr"
... )
>>> def clean_and_deduplicate_tags(decoded):
... tags = [tag.strip().lower() for tag in decoded.split(",")]
... # Remove non-informative or overly generic tokens
... ignore_list = {"a", "an", "the", "and", "or", "of", "to", "on", "in", "for", "with", "etc", "from"}
... filtered = [tag for tag in tags if tag not in ignore_list and len(tag) > 1]
... # Deduplicate while preserving order
... return ", ".join(dict.fromkeys(filtered))
>>> def generate_tags_with_pipeline(text):
... output = tag_generator(text, max_length=64, num_beams=5, early_stopping=True)
... decoded = output[0]["generated_text"]
... return clean_and_deduplicate_tags(decoded)
>>> # π€ Realistic repo descriptions for inference (from Hugging Face & this notebook)
>>> hf_repos = [
... "Best GitHub repositories with practical notebooks demonstrating real-world AI applications from Hugging Face.",
... "Best libraries for accessing NLP datasets and evaluation tools in Python.",
... "Searching for Hugging Face Diffusers repositories for generating images, audio, and other media with pre-trained diffusion models.",
... ]
>>> for repo in hf_repos:
... print(f"π₯ Input: {repo}")
... print(f"π·οΈ Tags: {generate_tags_with_pipeline(repo)}\n")
π₯ Input: Best GitHub repositories with practical notebooks demonstrating real-world AI applications from Hugging Face. π·οΈ Tags: github, repositories, practical, notebooks, demonstrating real-world, ai, hugging-face π₯ Input: Best libraries for accessing NLP datasets and evaluation tools in Python. π·οΈ Tags: nlp, datasets, evaluation, python π₯ Input: Searching for Hugging Face Diffusers repositories for generating images, audio, and other media with pre-trained diffusion models. π·οΈ Tags: images, audio, and other media, with pre-trained, diffusion-models.< > Update on GitHub