README.md

OpenGPT

OpenGPT is an open-source implementation of a GPT-style language model (decoder-only Transformer) in PyTorch. It is designed to be fully functional and easily extendable, supporting training from scratch on custom data, fine-tuning of pre-trained models, and integration with the Hugging Face Hub for model sharing. The project also supports advanced features such as multi-GPU distributed training with DeepSpeed, mixed-precision training for faster performance, and parameter-efficient fine-tuning (PEFT) methods.

Features

GPT Model Implementation: A complete GPT-style Transformer with configurable model size (number of layers, heads, embedding dimensions, etc.).
Custom Tokenizer: Byte-Pair Encoding (BPE) tokenizer training script to prepare your own vocabulary from a corpus.
Flexible Training: Single-node multi-GPU training via PyTorch Distributed (torchrun) or DeepSpeed, with support for FP16 mixed precision.
Evaluation: Perplexity computation and other evaluation metrics on validation/test sets.
Text Generation: Script to generate text from a prompt using the trained model, with adjustable decoding parameters (sampling or greedy).
Gradio Demo: A simple web demo (Gradio app) for interactive text generation.
Hugging Face Integration: Save and load models/tokenizers in Hugging Face format, and optionally push models to the Hugging Face Hub.
Extensibility: Designed to be extended with techniques like LoRA, adapters, or other PEFT approaches for fine-tuning.

Installation

Clone the repository:

git clone https://github.com/yourusername/OpenGPT.git
cd OpenGPT
Install dependencies:
pip install -r requirements.txt
Usage

Train a Tokenizer Before training the model, prepare a tokenizer on your corpus. For example, if you have a text file data/corpus.txt: python tokenizer_train.py --input data/corpus.txt --output tokenizer This will train a BPE tokenizer and save it to the tokenizer/ directory (generating a tokenizer.json file).
Training the Model Configure your model and training settings in a config file (e.g. config/gpt_small.yaml). Then run the training script. For single GPU or CPU training: python train.py --config config/gpt_small.yaml For multi-GPU training using PyTorch's distributed launcher (torchrun): torchrun --standalone --nproc_per_node=4 train.py --config config/gpt_small.yaml This will train the model on 4 GPUs (adjust nproc_per_node as needed). If you have DeepSpeed installed, you can use it for efficient large-scale training: deepspeed train.py --deepspeed config/deepspeed_zero2.json --config config/gpt_small.yaml This uses DeepSpeed ZeRO Stage 2 for memory optimization. Checkpoints will be saved to the directory specified in the config (e.g., checkpoints/).
Evaluation After training, evaluate the model on a validation set to compute perplexity: python evaluate.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer/tokenizer.json This loads the model checkpoint and tokenizer, and prints the perplexity on the validation dataset defined in the config.
Text Generation Use the trained model to generate text given a prompt: python generate.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer --prompt "Once upon a time" This will output a generated continuation for the prompt. You can adjust generation parameters like --max_length (number of tokens to generate) and --temperature (sampling randomness). Add --greedy for deterministic greedy decoding, or use --top_k to restrict sampling to the top k tokens.
Gradio Demo Launch the Gradio web demo to interact with the model in a browser: python demo/app.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer Open the provided local URL in your browser to use the web interface for text generation. Enter a prompt, adjust the sliders for length and creativity (temperature), and generate text. Hugging Face Hub

OpenGPT supports integration with the Hugging Face Hub: Saving for Hub: After training, you can save your model and tokenizer in Hugging Face format. For example: from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer/tokenizer.json") tokenizer.save_pretrained("hf_model") torch.save(model.state_dict(), "hf_model/pytorch_model.bin")

Also save a config.json in hf_model/ describing the model hyperparameters.

Pushing to Hub: Use the huggingface_hub library or CLI to push the saved model: pip install huggingface_hub huggingface-cli login huggingface-cli repo create vpu2301/Open_G-P-T-model git lfs install git add hf_model/* git commit -m "Add trained OpenGPT model" git push origin main