

TNSA
AI & ML interests
Building India's first Foundational Model and AGI
Recent Activity
NGen3: Next-Generation Foundational Model
NGen3 is a production-level foundational language model inspired by state-of-the-art architectures such as GPT-4, Claude-3, and Llama 2. It is designed to be highly modular, efficient, and accessible via a flexible command-line interface (CLI). NGen3 supports multiple model variants—from 7M parameters to 1B parameters—and offers a comprehensive suite of tools for:
- Tokenization: Process text from local files, URLs, or Hugging Face datasets.
- Training: Train the model on tokenized data.
- Sampling: Generate text from trained models.
- Exporting: Save models and minimal tokenizer configurations in formats compatible with Hugging Face.
- Knowledge Distillation: Train a smaller student model using a larger teacher model.
- Fine-Tuning: Adapt a distilled model on conversational data (from local sources or directly from Hugging Face).
This repository provides a complete implementation of the NGen3 model along with detailed CLI commands to facilitate experimentation and research.
Table of Contents
- Model Overview
- Architecture
- Installation
- Usage
- Hyperparameters
- Contributing
- License
- Acknowledgements
Model Overview
NGen3 is designed for rapid development and deployment of foundational language models. Its flexible CLI allows users to:
- Tokenize Text: Convert raw text or datasets into tokenized binary format.
- Train Models: Use various hyperparameter configurations based on the desired model size.
- Generate Samples: Evaluate model performance and generate text samples.
- Export Models: Easily export models in
safetensors
and JSON configurations for integration with Hugging Face tools. - Distill Models: Leverage knowledge distillation to compress larger models into efficient student variants.
- Fine-Tune on Conversations: Adapt models to conversational data using both local and Hugging Face datasets.
Architecture
NGen3’s architecture is built upon the transformer decoder design. Key components include:
- Token and Positional Embeddings: Learnable embeddings that encode input tokens and their positions.
- Stack of Transformer Blocks: Each block contains:
- Causal Self-Attention: With multi-head attention and masking to prevent information leakage.
- MLP (Feed-Forward Network): Utilizes GELU activation for non-linearity.
- Residual Connections and Layer Normalization: Stabilize training and improve convergence.
- Final Projection Layer: Maps embeddings to logits over the vocabulary.
The model supports variants with parameter counts ranging from 7M to 1B, making it adaptable for various research and production needs.
Installation
Ensure you have Python 3.8+ installed along with the following packages:
- PyTorch
- transformers
- datasets
- tqdm
- safetensors (for export functionality)
Install the required packages using pip:
pip install torch transformers datasets tqdm safetensors