Genos-m

Genos-m is a foundation model for human-associated microbial genomes. It is trained to model microbial DNA sequences at single-nucleotide resolution and supports ultra-long genomic contexts up to one million tokens.

For instructions, details, benchmarks, and examples, please refer to Genos-m GitHub and paper.

Model Specification

Specification	Genos-m-4.7B
Total parameters	4.7B
Activated parameters	0.33B
Architecture type	MoE
Number of experts	32
Selected experts per token	2
Number of layers	12
Attention hidden size	1024
Number of attention heads	16
Query groups	8
MoE hidden size per expert	4096
Vocabulary size	128 padded
Context length	up to 1M
Training objective	next-token prediction

Training Data

Genos-m was pretrained on curated microbial genome resources, including GTDB R220 representative prokaryotic genomes, public human-associated microbial genomes, in-house high-quality human gut MAGs, and UHGV human gut phage genomes. The final pre-training corpus contains approximately 1.2T tokens and covers 186 phyla, 3,448 families, and 69,056 species. Within this corpus, the retained human-associated prokaryotic subset covers 45 phyla, 585 families, and 12,273 species across major human microbial habitats, including the gut, oral cavity, skin, respiratory tract, and female reproductive tract.

Checkpoints

HF-Transformers checkpoint: BGI-HangzhouAI/Genos-m-4.7B
Megatron-LM checkpoint: BGI-HangzhouAI/Genos-m-Megatron-4.7B

License

Genos-m model and code are released under the Apache License 2.0.

Contact

For questions and suggestions, please open an issue.

Downloads last month: 12

Safetensors

Model size

5B params

Tensor type

F32