Genos-m

Genos-m is a foundation model for human-associated microbial genomes. It is trained to model microbial DNA sequences at single-nucleotide resolution and supports ultra-long genomic contexts up to one million tokens.

For instructions, details, benchmarks, and examples, please refer to Genos-m GitHub and paper.

Model Specification

Specification Genos-m-4.7B
Total parameters 4.7B
Activated parameters 0.33B
Architecture type MoE
Number of experts 32
Selected experts per token 2
Number of layers 12
Attention hidden size 1024
Number of attention heads 16
Query groups 8
MoE hidden size per expert 4096
Vocabulary size 128 padded
Context length up to 1M
Training objective next-token prediction

Training Data

Genos-m was pretrained on curated microbial genome resources, including GTDB R220 representative prokaryotic genomes, public human-associated microbial genomes, in-house high-quality human gut MAGs, and UHGV human gut phage genomes. The final pre-training corpus contains approximately 1.2T tokens and covers 186 phyla, 3,448 families, and 69,056 species. Within this corpus, the retained human-associated prokaryotic subset covers 45 phyla, 585 families, and 12,273 species across major human microbial habitats, including the gut, oral cavity, skin, respiratory tract, and female reproductive tract.

Checkpoints

License

Genos-m model and code are released under the Apache License 2.0.

Contact

For questions and suggestions, please open an issue.

Downloads last month
12
Safetensors
Model size
5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support