NGen3-140M / README.md
Thishyaketh's picture
Update README.md
7247e36 verified
metadata
license: other
license_name: ngen-2-community-license
license_link: https://tnsaai-builds.framer.website/community/licenses/ngen2
library_name: transformers
model-index:
  - name: NGen-3-135M
    results:
      - task:
          type: text-generation
        dataset:
          name: TCorpus5
          type: WebCorpus
        metrics:
          - name: MMLU
            type: accuracy
            value: 51.612
          - name: PIQA
            type: accuracy
            value: 74.284
          - name: Hellaswag
            type: accuracy
            value: 40.132
          - name: Winogrande
            type: accuracy
            value: 56.384
pipeline_tag: text-generation

NGen3: Next-Generation Foundational Model

NGen3 is a production-level foundational language model inspired by state-of-the-art architectures such as GPT-4, Claude-3, and Llama 2. It is designed for both research and production and supports model variants ranging from 7M to 1B parameters. The model is built with a modular transformer decoder architecture and provides a comprehensive command-line interface (CLI) for tokenization, training, sampling, exporting, knowledge distillation, and fine-tuning on conversational data.

alt text

Table of Contents


Overview

NGen3 is a flexible, self-contained implementation of a foundational language model built on a transformer decoder architecture. It enables users to:

  • Tokenize text from local files, URLs, or directly from Hugging Face datasets.
  • Train the model on tokenized datasets.
  • Generate text samples from trained models.
  • Export models (with minimal tokenizer configurations) to formats compatible with Hugging Face.
  • Distill knowledge from larger teacher models into smaller student models.
  • Fine-Tune on conversational datasets (using local files or datasets from Hugging Face).

Model Architecture

NGen3 uses a decoder-only transformer design with the following components:

  • Token & Positional Embeddings: Learnable embeddings for tokens and their positions.
  • Transformer Blocks: A stack of blocks, each containing:
    • Causal Self-Attention: Multi-head attention with a lower-triangular mask to prevent attention to future tokens.
    • Feed-Forward Network (MLP): With GELU activation.
    • Residual Connections & Layer Normalization: To stabilize training.
  • Final Projection Layer: Projects the hidden states to logits over the vocabulary.

The model comes in several variants:

  • 7M Variant: 4 layers, 4 heads, 128-dimensional embeddings.
  • 120M Variant: 12 layers, 8 heads, 512-dimensional embeddings.
  • 300M, 500M, 700M, and 1B Variants: Increasing in depth and width.

Evaluation Results

alt text

Installation

Ensure you have Python 3.8+ installed and install the necessary dependencies:

pip install torch transformers datasets tqdm safetensors

Usage

NGen3 is fully managed via a CLI. Below are examples for each command. Tokenization Local Text File or URL:

python _model_.py tokenize --dataset tinyshakespeare --txt "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"

Hugging Face Dataset:

python _model_.py hf_tokenize --hf_dataset roskoN/dailydialog --hf_split train --hf_text_column utterances --dataset dailydialog_train

Training

Train a model variant (e.g., 7M):

python _model_.py train --variant 7M --data _data_tinyshakespeare_/data.bin

Sampling

Generate text samples from a trained model:

python _model_.py sample --variant 7M --model_checkpoint 7M_model.pt --prompt "To be, or not to be" --length 100 --temperature 1.0

Exporting

Export a trained model (and its tokenizer configuration) for Hugging Face:

python _model_.py export --variant 7M --model_path 7M_model.pt --output_dir exported_7M

Knowledge Distillation

Distill a larger teacher model (e.g., GPT-2 120M from HF) into a smaller student model (e.g., 7M):

python _model_.py distill --teacher_model_path hf --teacher_variant 120M --student_variant 7M --data _data_tinyshakespeare_/data.bin --temperature 2.0 --alpha 0.5

Fine-Tuning

Local Fine-Tuning on Conversational Data Fine-tune a distilled model using local conversation data:


python _model_.py finetune --variant 120M --model_checkpoint distilled_120M_model.pt --data _data_conversations_/data.bin --finetune_iters 1000 --prompt "Hello, how are you?" --sample_length 100 --sample_temperature 1.0

Hugging Face Fine-Tuning on a Conversational Dataset Fine-tune on a conversational dataset from Hugging Face (e.g., roskoN/dailydialog):


python _model_.py hf_finetune --variant 120M --model_checkpoint distilled_120M_model.pt --hf_dataset roskoN/dailydialog --hf_split train --hf_text_column utterances --finetune_iters 1000 --prompt "Hello, how are you?" --sample_length 100 --sample_temperature 1.0

Sampling and Exporting Fine-Tuned Models

After fine-tuning, you can sample from or export the fine-tuned model just as with any checkpoint. For example, if your fine-tuned model is saved as finetuned_120M_model.pt:

Sampling:

python _model_.py sample --variant 120M --model_checkpoint finetuned_120M_model.pt --prompt "What do you think about AI?" --length 100 --temperature 1.0

Exporting:

python _model_.py export --variant 120M --model_path finetuned_120M_model.pt --output_dir exported_finetuned_120M

Hyperparameters

Each model variant comes with predefined hyperparameters. For example:

7M Variant:

Layers: 4, Heads: 4, Embedding Dimension: 128 Block Size: 128, Batch Size: 16, Learning Rate: 3e-4 120M Variant:

Layers: 12, Heads: 8, Embedding Dimension: 512 Block Size: 256, Batch Size: 32, Learning Rate: 3e-4 300M, 500M, 700M, 1B Variants: Increasing layers, heads, and embedding dimensions for better performance.

Adjust max_iters, log_interval, and eval_interval to suit your dataset size and computational resources.

Acknowledgements

NGen3 is inspired by leading models including GPT-4, Claude-3, and Llama 2. Special thanks to the open-source community for:

  • PyTorch
  • Hugging Face Transformers
  • Hugging Face Datasets
  • safetensors