README.md

OpenGPT

OpenGPT is an open-source implementation of a GPT-style language model (decoder-only Transformer) in PyTorch. It is designed to be fully functional and easily extendable, supporting training from scratch on custom data, fine-tuning of pre-trained models, and integration with the Hugging Face Hub for model sharing. The project also supports advanced features such as multi-GPU distributed training with DeepSpeed, mixed-precision training for faster performance, and parameter-efficient fine-tuning (PEFT) methods.

Features

  • GPT Model Implementation: A complete GPT-style Transformer with configurable model size (number of layers, heads, embedding dimensions, etc.).
  • Custom Tokenizer: Byte-Pair Encoding (BPE) tokenizer training script to prepare your own vocabulary from a corpus.
  • Flexible Training: Single-node multi-GPU training via PyTorch Distributed (torchrun) or DeepSpeed, with support for FP16 mixed precision.
  • Evaluation: Perplexity computation and other evaluation metrics on validation/test sets.
  • Text Generation: Script to generate text from a prompt using the trained model, with adjustable decoding parameters (sampling or greedy).
  • Gradio Demo: A simple web demo (Gradio app) for interactive text generation.
  • Hugging Face Integration: Save and load models/tokenizers in Hugging Face format, and optionally push models to the Hugging Face Hub.
  • Extensibility: Designed to be extended with techniques like LoRA, adapters, or other PEFT approaches for fine-tuning.

Installation

  1. Clone the repository:

    git clone https://github.com/yourusername/OpenGPT.git
    cd OpenGPT
    Install dependencies:
    pip install -r requirements.txt
    Usage
    
  2. Train a Tokenizer Before training the model, prepare a tokenizer on your corpus. For example, if you have a text file data/corpus.txt: python tokenizer_train.py --input data/corpus.txt --output tokenizer This will train a BPE tokenizer and save it to the tokenizer/ directory (generating a tokenizer.json file).

  3. Training the Model Configure your model and training settings in a config file (e.g. config/gpt_small.yaml). Then run the training script. For single GPU or CPU training: python train.py --config config/gpt_small.yaml For multi-GPU training using PyTorch's distributed launcher (torchrun): torchrun --standalone --nproc_per_node=4 train.py --config config/gpt_small.yaml This will train the model on 4 GPUs (adjust nproc_per_node as needed). If you have DeepSpeed installed, you can use it for efficient large-scale training: deepspeed train.py --deepspeed config/deepspeed_zero2.json --config config/gpt_small.yaml This uses DeepSpeed ZeRO Stage 2 for memory optimization. Checkpoints will be saved to the directory specified in the config (e.g., checkpoints/).

  4. Evaluation After training, evaluate the model on a validation set to compute perplexity: python evaluate.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer/tokenizer.json This loads the model checkpoint and tokenizer, and prints the perplexity on the validation dataset defined in the config.

  5. Text Generation Use the trained model to generate text given a prompt: python generate.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer --prompt "Once upon a time" This will output a generated continuation for the prompt. You can adjust generation parameters like --max_length (number of tokens to generate) and --temperature (sampling randomness). Add --greedy for deterministic greedy decoding, or use --top_k to restrict sampling to the top k tokens.

  6. Gradio Demo Launch the Gradio web demo to interact with the model in a browser: python demo/app.py --model checkpoints/epoch1.pt --config config/gpt_small.yaml --tokenizer tokenizer Open the provided local URL in your browser to use the web interface for text generation. Enter a prompt, adjust the sliders for length and creativity (temperature), and generate text. Hugging Face Hub

OpenGPT supports integration with the Hugging Face Hub: Saving for Hub: After training, you can save your model and tokenizer in Hugging Face format. For example: from transformers import PreTrainedTokenizerFast tokenizer = PreTrainedTokenizerFast(tokenizer_file="tokenizer/tokenizer.json") tokenizer.save_pretrained("hf_model") torch.save(model.state_dict(), "hf_model/pytorch_model.bin")

Also save a config.json in hf_model/ describing the model hyperparameters.

Pushing to Hub: Use the huggingface_hub library or CLI to push the saved model: pip install huggingface_hub huggingface-cli login huggingface-cli repo create vpu2301/Open_G-P-T-model git lfs install git add hf_model/* git commit -m "Add trained OpenGPT model" git push origin main

Downloads last month
0
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support