Manas-1

Manas-1 is a text-only language model project for Indian-language and cultural knowledge tasks, with a strong focus on Malayalam and Kerala cultural traditions. The project is aligned around sarvamai/sarvam-30b as its base model and uses OpenManas Apache-2.0 datasets as the current default training source.

Current Status

This repository is a training-ready scaffold, not a finished trained checkpoint. The uploaded model.safetensors file is a placeholder and does not contain real Manas-1 trained weights yet.

Current artifacts include:

  • Base-model metadata for Sarvam-30B
  • OpenManas processed SFT data
  • QLoRA training configuration
  • Data preparation script
  • QLoRA supervised fine-tuning script
  • Model, dataset, architecture, and training documentation

Base Model

Field Value
Base model sarvamai/sarvam-30b
Architecture Sarvam MoE causal language model
Model type sarvam_moe
Base parameter count 32.15B
Pipeline Text generation
License Apache-2.0
Modality Text only

Sarvam-30B is a strong base for Manas-1 because it is an Indian-built open-weight model with Indian-language coverage, multilingual reasoning support, and Malayalam among its supported languages.

Scope

Manas-1 is currently scoped to:

  • Text generation
  • Malayalam and Indian-language assistant behavior
  • Kerala cultural knowledge
  • Indian classical arts knowledge
  • Retrieval-style and explanatory answers after future expansion
  • QLoRA / LoRA fine-tuning from Sarvam-30B

Manas-1 is not currently scoped to:

  • Image generation or image understanding
  • Audio or speech tasks
  • Multimodal training
  • Medical, legal, financial, or other high-stakes advice

Default Training Data

The default training mix uses the OpenManas datasets at https://huggingface.co/openmanas/datasets. All default sources are text datasets with Apache-2.0 metadata.

Dataset Rows Purpose
openmanas/carnatic_music 30 Carnatic music knowledge
openmanas/kalaripayattu 30 Kalaripayattu knowledge
openmanas/kathakali 30 Kathakali knowledge
openmanas/koodiyattam 30 Koodiyattam knowledge
openmanas/kuchipudi 30 Kuchipudi knowledge
openmanas/mohiniyattam 30 Mohiniyattam knowledge
openmanas/ottamthullal 30 Ottamthullal knowledge
openmanas/theyyam 30 Theyyam knowledge

The generated processed split is:

  • data/processed/train.jsonl: 228 rows
  • data/processed/validation.jsonl: 12 rows
  • data/processed/manifest.json: source manifest

Each processed row contains:

  • messages: chat-style user/assistant messages
  • text: flattened training text
  • source_dataset
  • source_config
  • source_split
  • language

Training Method

The first intended training path is QLoRA supervised fine-tuning from sarvamai/sarvam-30b.

Key defaults:

  • 4-bit NF4 quantization
  • bfloat16 compute
  • LoRA rank 16
  • LoRA alpha 32
  • LoRA dropout 0.05
  • Max sequence length 4096
  • Gradient checkpointing enabled
  • W&B logging configured through report_to: wandb

Training configuration lives in:

training/config.yaml

Prepare Data

Install dependencies:

pip install -r training/requirements.txt

Build processed SFT data from the configured OpenManas datasets:

python training/prepare_ai4bharat_data.py --config training/config.yaml --output-dir data/processed

Expected output:

data/processed/train.jsonl
data/processed/validation.jsonl
data/processed/manifest.json

Train

Run QLoRA fine-tuning:

python training/train_qlora.py \
  --config training/config.yaml \
  --train-file data/processed/train.jsonl \
  --validation-file data/processed/validation.jsonl \
  --output-dir outputs/manas-1-qlora

Push a trained adapter or checkpoint only after a successful small run and evaluation:

python training/train_qlora.py \
  --config training/config.yaml \
  --train-file data/processed/train.jsonl \
  --validation-file data/processed/validation.jsonl \
  --output-dir outputs/manas-1-qlora \
  --push-to-hub \
  --hub-model-id openmanas/manas-1

Hardware Notes

Sarvam-30B is a 30B-class model. Even with QLoRA, training requires a suitable CUDA environment and enough GPU memory. The local repository scripts are ready, but actual training should be run on appropriate GPU infrastructure such as a multi-GPU machine, cloud GPU instance, or managed training job.

Evaluation Plan

Before publishing real weights, evaluate:

  • Malayalam factual quality
  • Cultural knowledge accuracy
  • Hallucination rate on Kerala art-form questions
  • Instruction following
  • English/Malayalam code-switching behavior
  • Safety on medical, legal, financial, and other high-stakes prompts

Suggested evaluation candidates:

  • Held-out OpenManas validation rows
  • Human-written Malayalam cultural questions
  • ai4bharat/IndicIFEval for instruction-following evaluation

Repository Structure

manas-1/
|-- README.md
|-- LICENSE
|-- config.json
|-- tokenizer.json
|-- tokenizer_config.json
|-- vocab.json
|-- merges.txt
|-- model.safetensors
|-- data/
|   |-- README.md
|   |-- raw/
|   `-- processed/
|       |-- manifest.json
|       |-- train.jsonl
|       `-- validation.jsonl
|-- training/
|   |-- config.yaml
|   |-- finetune.py
|   |-- prepare_ai4bharat_data.py
|   |-- pretrain.py
|   |-- requirements.txt
|   `-- train_qlora.py
|-- evals/
|   |-- benchmarks/
|   `-- results/
`-- docs/
    |-- architecture.md
    |-- datasets.md
    |-- model_card.md
    `-- training.md

Important Limitations

  • The repository does not yet contain real trained Manas-1 weights.
  • The current processed dataset is small: 240 total examples.
  • Dataset rows are structured cultural entries, not broad general instruction data.
  • Base-model capability does not guarantee final Manas-1 behavior until actual fine-tuning and evaluation are completed.
  • Use in high-stakes domains is out of scope without domain-specific validation and human oversight.

License

Manas-1 is documented for Apache-2.0 compatibility. The base model sarvamai/sarvam-30b is Apache-2.0, and the current default OpenManas datasets are Apache-2.0.

See LICENSE for the full license text.

References

Downloads last month
34
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for openmanas/manas-1

Finetuned
(6)
this model

Datasets used to train openmanas/manas-1