license: apache-2.0
language:
- en
tags:
- biology
- genomics
- llama
- fine-tuned
- plasmid
- gene-function
- genome-assembly
- gene-essentiality
pipeline_tag: text-generation
base_model: meta-llama/Meta-Llama-3.1-8B
GenSyntax
GenSyntax is a fine-tuned large language model for genomic sequence analysis and inference. Built on the Llama 3.1 8B architecture, it is specifically adapted for five core genomic tasks: plasmid host identification, gene function prediction, genome assembly, gene essentiality prediction, and minimal genome derivation.
Model Details
| Property | Value |
|---|---|
| Base Model | Meta-Llama-3.1-8B |
| Architecture | LlamaForCausalLM |
| Parameters | ~8B |
| Hidden Size | 4096 |
| Layers | 32 |
| Attention Heads | 32 (GQA: 8 KV heads) |
| Context Length | 131,072 tokens |
| Precision | bfloat16 |
Intended Use
GenSyntax is designed for computational biology researchers who need to apply LLM-based reasoning to genomic sequences. It supports the following inference tasks:
- Plasmid Host Identification — predict the bacterial host range of a plasmid from its sequence.
- Gene Function Prediction — infer the functional annotation of a gene given its sequence context.
- Genome Assembly — reconstruct genome sequences from contig fragments.
- Gene Essentiality Prediction — classify whether a gene is essential for cell survival.
- Minimal Genome Derivation — determine the minimal gene set required for a viable organism.
Hardware Requirements
A single NVIDIA RTX 4090 (24 GB VRAM) is sufficient for inference. For faster throughput, multi-GPU setups are supported via device_map="auto".
How to Use
Load the Model
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_path = "MoonTideF/GenSyntax" # or local path
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
)
Inference Scripts
Clone the GenSyntax repository and use the provided scripts:
git clone https://github.com/nishiwen1214/GenSyntax.git
cd GenSyntax
pip install -r requirements.txt
Plasmid Host Identification
python Plasmid_host_identification.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task1_test_1000_format.json
Gene Function Prediction
python Gene_function_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task2_test_500_opts.json
Genome Assembly
python Genome_assembly.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task3_test_500_contig3_format.json
Gene Essentiality Prediction
python Gene_essentiality_prediction.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/gene_task4_test_1000_format.json
Minimal Genome Derivation
python minimal_genome_inference.py \
--model /path/to/GenSyntax \
--input-json-paths test_data/bacteria_chromosomes_9-mini.json
Training Data
The training and evaluation datasets are available on HuggingFace:
👉 GenSyntax Datasets on HuggingFace
The dataset includes complete test sets for each task, along with training and test data for cell phenotype prediction.
Generation Config
| Parameter | Value |
|---|---|
temperature |
0.6 |
top_p |
0.9 |
do_sample |
True |
Citation
If you use GenSyntax in your research, please cite the corresponding paper and link to the GitHub repository.
License
This model is released under the Apache 2.0 License.