Open-Source AI Cookbook documentation

πŸ”– GitHub Tag Generator with T5 + PEFT (LoRA)

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Open In Colab

πŸ”– GitHub Tag Generator with T5 + PEFT (LoRA)

Authored by: Zamal Babar

In this notebook, we walk through a complete end-to-end implementation of a lightweight, fast, and open-source GitHub tag generator using T5-small fine-tuned on a custom dataset with PEFT (LoRA). This tool can automatically generate relevant tags from a GitHub repository description or summary β€” useful for improving discoverability and organizing repos more intelligently.


πŸ’‘ Use Case

Imagine you’re building a tool that helps users explore GitHub repositories more effectively. Instead of relying on manually written or sometimes missing tags, we train a model that automatically generates descriptive tags for any GitHub project. This could help:

  • Improve search functionality
  • Automatically tag new repos
  • Build better filters for discovery

πŸ“¦ Dataset

We use a dataset of GitHub project descriptions and their associated tags. Each training example contains:

  • "input": A natural language description of a GitHub repository
  • "target": A comma-separated list of relevant tags

The dataset was initially loaded from a local .jsonl file, but is now also available on the Hugging Face Hub here:
➑️ zamal/github-meta-data


🧠 Model Architecture

We fine-tuned the T5-small model for this task β€” a lightweight encoder-decoder transformer that’s well-suited for text-to-text generation tasks.
To make fine-tuning faster and more efficient, we used the πŸ€— peft library with LoRA (Low-Rank Adaptation) to update only a subset of model parameters.


βœ… What This Notebook Covers

This notebook includes:

  • βœ… Loading and preprocessing a custom dataset
  • βœ… Setting up a T5-small model with LoRA
  • βœ… Training the model using the Hugging Face Trainer
  • βœ… Monitoring progress with Weights & Biases
  • βœ… Saving and pushing the model to the Hugging Face Hub
  • βœ… Performing inference and postprocessing for clean, deduplicated tags

πŸ” Final Outcome

By the end of this notebook, you’ll have:

  • πŸš€ A fully trained and hosted GitHub tag generator
  • πŸ” A deployable and shareable model on Hugging Face Hub
  • 🧠 An inference function to use your model anywhere with just a few lines of code

Let’s dive in! 🎯

We begin by:

  • Importing essential libraries for model training (transformers, datasets, peft)
  • Loading the T5 tokenizer
  • Setting the Hugging Face token (stored securely in Colab’s userdata)

Make sure you’ve stored your HUGGINGFACE_TOKEN in your Colab’s secrets before running this cell.

from google.colab import userdata
import os

os.environ["HUGGINGFACE_TOKEN"] = userdata.get("HUGGINGFACE_TOKEN")
from transformers import T5Tokenizer, T5ForConditionalGeneration, Trainer, TrainingArguments
import os
from datasets import load_dataset
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training, PeftConfig
tokenizer = T5Tokenizer.from_pretrained("t5-small")

πŸ“¦ Load and Prepare the Dataset

We now load our training data from a local JSONL file that contains repository descriptions and their corresponding tags.

Each line in the file is a JSON object with two fields:

  • input: a short repository description
  • target: the tags (comma-separated)

We split this dataset into training and validation sets using a 90/10 ratio.

πŸ” Note: When this notebook was initially run, the dataset was loaded locally from a file. However, the same dataset is now also available on the Hugging Face Hub here: zamal/github-meta-data. Feel free to load it directly using load_dataset("zamal/github-meta-data") in your workflow as shown below.

from datasets import load_dataset, DatasetDict

# Load existing dataset with only a "train" split
dataset = load_dataset("zamal/github-meta-data")  # returns DatasetDict

# Split the train set into train and validation
split = dataset["train"].train_test_split(test_size=0.1, seed=42)

# Wrap into a new DatasetDict
dataset_dict = DatasetDict({"train": split["train"], "validation": split["test"]})
>>> print(len(dataset_dict["train"]))
>>> print(len(dataset_dict["validation"]))
552
62

πŸ”€ Load the Tokenizer

We load the tokenizer associated with the t5-small model. T5 expects input and output text to be tokenized in a specific way, and this tokenizer ensures compatibility during training and inference.

from transformers import AutoTokenizer

model_name = "t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)

🧹 Preprocessing the Dataset

Next, we define a preprocessing function to tokenize both the inputs and the targets using the T5 tokenizer.

  • The inputs are padded and truncated to a maximum length of 128 tokens.
  • The target labels (i.e., tags) are also tokenized with a shorter maximum length of 64 tokens.

We then map this preprocessing function across our training and validation datasets and format the output for PyTorch compatibility. This prepares the dataset for training.

def preprocess(batch):
    inputs = batch["input"]
    targets = batch["target"]
    model_inputs = tokenizer(inputs, max_length=128, truncation=True, padding="max_length")
    labels = tokenizer(targets, max_length=64, truncation=True, padding="max_length").input_ids
    model_inputs["labels"] = labels
    return model_inputs
tokenized = dataset_dict.map(preprocess, batched=True, remove_columns=dataset_dict["train"].column_names)
tokenized.set_format(type="torch", columns=["input_ids", "attention_mask", "labels"])

Loading the Base T5 Model

We load the base T5 model (t5-small) for conditional generation. This model serves as the backbone for our tag generation task, where the goal is to generate relevant tags given a description of a GitHub repository.

model = T5ForConditionalGeneration.from_pretrained(model_name)

πŸ”§ Preparing the LoRA Configuration

We configure LoRA (Low-Rank Adaptation) to fine-tune the T5 model efficiently. LoRA injects trainable low-rank matrices into attention layers, significantly reducing the number of trainable parameters while maintaining performance.

In this setup:

  • r=16 defines the rank of the update matrices.
  • lora_alpha=32 scales the updates.
  • We apply LoRA to the "q" and "v" attention projection modules.
  • The task type is set to "SEQ_2_SEQ_LM" since we’re working on a sequence-to-sequence task.
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q", "v"],  # Adjust based on model architecture
    lora_dropout=0.05,
    bias="none",
    task_type="SEQ_2_SEQ_LM",
)

πŸ”Œ Injecting LoRA into the Base T5 Model

Now that we’ve defined our LoRA configuration, we apply it to the base T5 model using get_peft_model(). This wraps the original model with the LoRA adapters, allowing us to fine-tune only a small number of parameters instead of the entire modelβ€”making training faster and more memory-efficient.

model = get_peft_model(model, lora_config)

πŸ› οΈ TrainingArguments Configuration

We use the TrainingArguments class to define the hyperparameters and training behavior for our model. Here’s a breakdown of each parameter:

  • output_dir="./t5_tag_generator"
    Directory to save model checkpoints and training logs.

  • per_device_train_batch_size=8
    Number of training samples per GPU/TPU core (or CPU) in each training step.

  • per_device_eval_batch_size=8
    Number of evaluation samples per GPU/TPU core (or CPU) in each evaluation step.

  • learning_rate=1e-4
    Initial learning rate. A good starting point for T5 models with LoRA.

  • num_train_epochs=25
    Total number of training epochs. This is relatively high to ensure convergence for our use case.

  • logging_steps=10
    How often (in steps) to log training metrics to the console and W&B.

  • eval_strategy="steps"
    Run evaluation every eval_steps instead of after every epoch.

  • eval_steps=50
    Evaluate the model every 50 steps to monitor progress during training.

  • save_steps=50
    Save model checkpoints every 50 steps for redundancy and safe restoration.

  • save_total_limit=2
    Keep only the 2 most recent model checkpoints to save disk space.

  • fp16=True
    Enable mixed precision training (faster and memory-efficient on supported GPUs).

  • push_to_hub=True
    Automatically push the trained model to the Hugging Face Hub.

  • hub_model_id="zamal/github-tag-generatorr"
    The model repo name on Hugging Face under your username. This is where checkpoints and final model weights will be pushed.

  • hub_token=os.environ['HUGGINGFACE_TOKEN']
    Token to authenticate your Hugging Face account. We securely retrieve this from the environment.

This setup ensures a balance between training efficiency, frequent monitoring, and safe saving of model progress.

training_args = TrainingArguments(
    output_dir="./t5_tag_generator",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=1e-4,
    num_train_epochs=25,
    logging_steps=10,
    eval_strategy="steps",
    eval_steps=50,
    save_steps=50,
    save_total_limit=2,
    fp16=True,
    push_to_hub=True,
    hub_model_id="zamal/github-tag-generatorr",  # Replace with your Hugging Face username
    hub_token=os.environ["HUGGINGFACE_TOKEN"],
)

🧠 Initialize the Trainer

We now configure the Trainer, which abstracts away the training loop, evaluation steps, logging, and saving. It handles all of it for us using the parameters we’ve defined earlier.

We also pass in the DataCollatorForSeq2Seq, which ensures proper padding and batching during training and evaluation for sequence-to-sequence tasks like ours.

⚠️ Warnings Explained:

  • FutureWarning: 'tokenizer' is deprecated...
    As of Transformers v5.0.0, the tokenizer argument in Trainer is deprecated. Instead, Hugging Face recommends using the processing_class, which refers to a processor that combines tokenization and potentially feature extraction. For now, it’s safe to ignore this, but it’s good practice to track deprecations for future compatibility.

  • No label_names provided for model class 'PeftModelForSeq2SeqLM'
    This warning appears because we’re using a PEFT (Parameter-Efficient Fine-Tuning) wrapped model (PeftModelForSeq2SeqLM), and the Trainer cannot automatically determine the label field names in this case.
    Since we’re already formatting our dataset correctly (by explicitly setting labels during preprocessing), this warning can be safely ignored as well β€” training will still proceed correctly.

Now, we can initialize our Trainer:

from transformers import Trainer
from transformers import DataCollatorForSeq2Seq

data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=data_collator,
)

πŸš€ Start Training the Tag Generator Model

With everything set up β€” the model, tokenizer, dataset, LoRA configuration, training arguments, and the Trainer β€” we can now kick off the fine-tuning process by calling trainer.train().

This will:

  • Fine-tune our T5 model using the parameter-efficient LoRA strategy.
  • Save checkpoints at regular intervals (save_steps=50).
  • Evaluate on the validation set every 50 steps (eval_steps=50).
  • Log metrics like loss to Weights & Biases or the Hugging Face Hub if integrated.

Training will take some time depending on the size of your dataset and GPU, but you’ll start to see metrics printed out step-by-step, such as:

  • Training Loss: how well the model is fitting the training data.
  • Validation Loss: how well the model performs on unseen data.

Let’s begin the fine-tuning! πŸ‘‡

>>> trainer.train()
Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·

βœ… Training Summary and Observations

The training process successfully completed over 25 epochs, using a LoRA-fine-tuned T5-small model to generate tags for GitHub repository descriptions. Here’s a quick breakdown of what happened and how to interpret it:

πŸ”„ Logging with Weights & Biases (W&B)

We logged all training metrics and artifacts using Weights & Biases, which offers a convenient UI to monitor model performance in real time. You can view the run at: πŸ‘‰ W&B Project Run

πŸ“‰ Training & Validation Loss

From the logs:

  • Training loss began at 8.9 (random init) and steadily declined to ~1.06.
  • Validation loss also dropped consistently from 7.9 to 0.95, indicating good generalization and minimal overfitting.

The slight fluctuations (e.g., at steps 850, 1000, 1100) are normal and reflect natural variance in optimization, especially with small batch sizes.

βš™οΈ Warnings and Notices

  • The warning about past_key_values being deprecated is safe to ignore for now and expected behavior with the current transformers version.
  • UserWarning about tensor creation can be optimized later, but doesn’t affect the result.
  • The run_name warning suggests you can optionally decouple logging folder names from output directories.

πŸ“Š Performance Metrics

The model completed:

  • 1725 training steps
  • ~1.9 samples/sec processing speed
  • Total training time: ~2 hours

This is solid performance given the setup and confirms that your LoRA fine-tuning pipeline is both stable and efficient.


Next, we’ll save and push this trained model to the Hugging Face Hub so you (or others!) can load and test it anytime. πŸš€

πŸ” Inference: Generate Tags from Repository Descriptions

Now that the model is trained, we define a simple helper function generate_tags to run inference. It takes a natural language query describing a repository and generates relevant tags using our fine-tuned T5 model.

Below is an example for a query related to image augmentation and no-code tools.

import torch


def generate_tags(query, model, tokenizer, max_length=64, num_beams=5):
    model.eval()
    inputs = tokenizer(query, return_tensors="pt", truncation=True, padding="max_length", max_length=128).to(
        model.device
    )

    with torch.no_grad():
        output = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            num_beams=num_beams,
            early_stopping=True,
            decoder_start_token_id=tokenizer.pad_token_id,  # πŸ‘ˆ required for T5
        )
    return tokenizer.decode(output[0], skip_special_tokens=True)
generate_tags("looking for repositories on image augmentation no code implementations", model, tokenizer)

πŸ’Ύ Save Fine-Tuned Model Locally

Once the training is complete, we save the fine-tuned model and tokenizer to a local directory. This allows us to reuse or share the model later without needing to retrain it.

>>> # Save model, tokenizer, and config to local output directory
>>> model_path = "./t5_tag_generator/final"

>>> model.save_pretrained(model_path)
>>> tokenizer.save_pretrained(model_path)

>>> print("βœ… Model and tokenizer saved locally at:", model_path)
βœ… Model and tokenizer saved locally at: ./t5_tag_generator/final

πŸš€ Push Model to Hugging Face Hub

After saving the model locally, we now push it to the Hugging Face Hub so that others can easily access, test, and load it using from_pretrained.

➑️ The model is publicly available at: huggingface.co/zamal/github-tag-generatorr

>>> # Push to Hugging Face Hub under your repo
>>> from huggingface_hub import HfApi

>>> api = HfApi()
>>> api.upload_folder(
...     folder_path=model_path,
...     repo_id="zamal/github-tag-generatorr",  # Your model ID
...     repo_type="model",
...     path_in_repo="",  # Root of the repo
... )

>>> print("πŸš€ Model pushed to Hugging Face Hub: https://huggingface.co/zamal/github-tag-generatorr")
πŸš€ Model pushed to Hugging Face Hub: https://huggingface.co/zamal/github-tag-generatorr

πŸ“¦ Load Model Directly from Hugging Face Hub

Now that we’ve pushed our fine-tuned model to the Hugging Face Hub, we can easily load it from anywhere using the pipeline utility. This allows us to instantly test or integrate the model into other applications without needing local files.

The model is hosted at: zamal/github-tag-generatorr

from transformers import pipeline

# Load the model and tokenizer from Hugging Face Hub
tag_generator = pipeline(
    "text2text-generation", model="zamal/github-tag-generatorr", tokenizer="zamal/github-tag-generatorr"
)

🧠 Inference Function with Post-Processing

This function wraps the model inference process to generate tags for a given GitHub project description. We prepend the prefix "generate tags: " (which the model was trained on) and tokenize the input appropriately before calling model.generate().

After decoding the generated output, we deduplicate the tags using a simple dict.fromkeys() trick. This ensures that tags like "pytorch, pytorch, pytorch" only appear once.

We added this logic because the training data included some noisy samples with repeated or inconsistent tags. Since we did not perform extensive data cleaning or multiple training runs to refine the quality, this lightweight fix helps improve the final output. In a production-grade system, we’d recommend:

  • more rigorous data preprocessing,
  • filtering weak labels,
  • and performing iterative fine-tuning with evaluation and human-in-the-loop review.
def generate_tags(text, model, tokenizer, max_length=64, num_beams=5):
    input_text = text
    inputs = tokenizer(input_text, return_tensors="pt", padding="max_length", truncation=True, max_length=128).to(
        model.device
    )

    with torch.no_grad():
        output = model.generate(
            input_ids=inputs["input_ids"],
            attention_mask=inputs["attention_mask"],
            max_length=max_length,
            num_beams=num_beams,
            early_stopping=True,
            decoder_start_token_id=tokenizer.pad_token_id,
        )
    decoded = tokenizer.decode(output[0], skip_special_tokens=True)

    # Deduplicate and clean tags
    tags = [t.strip().lower() for t in decoded.split(",")]
    unique_tags = list(dict.fromkeys(tags))  # preserve order + remove duplicates
    return ", ".join(unique_tags)
πŸ” Real-world Examples: Testing on Sample Inputs

Now that we’ve defined our inference function and loaded the model, let’s run it on a few example descriptions.

Each input represents a short summary of a hypothetical GitHub repository. Our goal is to generate meaningful and concise tags using the fine-tuned T5 model.

These test cases demonstrate how well the model generalizes to realistic prompts β€” and thanks to our post-processing, any repetitive or noisy tags are cleaned up before display.

>>> inputs = [
...     "Need an AI tool to convert customer voice calls into structured CRM record",
...     "How to train a text summarization model using Pegasus or BART",
...     "Fine-tuning BERT for spam detection in emails",
... ]

>>> for text in inputs:
...     print(f"πŸ“₯ Input: {text}")
...     print(f"🏷️ Tags: {generate_tags(text, model, tokenizer)}\n")
πŸ“₯ Input: Need an AI tool to convert customer voice calls into structured CRM record
🏷️ Tags: voice-calls, crm-recording, voice-recording

πŸ“₯ Input: How to train a text summarization model using Pegasus or BART
🏷️ Tags: text summarization, pegasus, bart, et al.

πŸ“₯ Input: Fine-tuning BERT for spam detection in emails
🏷️ Tags: bert, spam-detecting, email-tuning
πŸ” Inference Examples Using Real Hugging Face Projects

To follow the same format as our training data, we rephrase descriptive statements into query-style inputs β€” just like users would naturally search for repositories. This aligns with our fine-tuning data, which was based on natural language search queries mapped to relevant tags.

Below are some meta and practical examples, including:

  • Hugging Face’s own popular repositories (e.g., Transformers, Datasets, Diffusers)
  • Styled as realistic queries for better inference consistency
>>> from transformers import pipeline
>>> import torch

>>> # Load the model and tokenizer from the Hugging Face Hub
>>> tag_generator = pipeline(
...     "text2text-generation", model="zamal/github-tag-generatorr", tokenizer="zamal/github-tag-generatorr"
... )


>>> def clean_and_deduplicate_tags(decoded):
...     tags = [tag.strip().lower() for tag in decoded.split(",")]

...     # Remove non-informative or overly generic tokens
...     ignore_list = {"a", "an", "the", "and", "or", "of", "to", "on", "in", "for", "with", "etc", "from"}
...     filtered = [tag for tag in tags if tag not in ignore_list and len(tag) > 1]

...     # Deduplicate while preserving order
...     return ", ".join(dict.fromkeys(filtered))


>>> def generate_tags_with_pipeline(text):
...     output = tag_generator(text, max_length=64, num_beams=5, early_stopping=True)
...     decoded = output[0]["generated_text"]
...     return clean_and_deduplicate_tags(decoded)


>>> # πŸ€— Realistic repo descriptions for inference (from Hugging Face & this notebook)
>>> hf_repos = [
...     "Best GitHub repositories with practical notebooks demonstrating real-world AI applications from Hugging Face.",
...     "Best libraries for accessing NLP datasets and evaluation tools in Python.",
...     "Searching for Hugging Face Diffusers repositories for generating images, audio, and other media with pre-trained diffusion models.",
... ]


>>> for repo in hf_repos:
...     print(f"πŸ“₯ Input: {repo}")
...     print(f"🏷️ Tags: {generate_tags_with_pipeline(repo)}\n")
πŸ“₯ Input: Best GitHub repositories with practical notebooks demonstrating real-world AI applications from Hugging Face.
🏷️ Tags: github, repositories, practical, notebooks, demonstrating real-world, ai, hugging-face

πŸ“₯ Input: Best libraries for accessing NLP datasets and evaluation tools in Python.
🏷️ Tags: nlp, datasets, evaluation, python

πŸ“₯ Input: Searching for Hugging Face Diffusers repositories for generating images, audio, and other media with pre-trained diffusion models.
🏷️ Tags: images, audio, and other media, with pre-trained, diffusion-models.
< > Update on GitHub