LLM-GO

A Go-specialized large language model built with TensorFlow 2 and Python 3.12. Trained on all Golang versions (1.0–1.24), the Fiber and Cobra ecosystems, real-world project patterns, and Go best practices. Published to Hugging Face as an open-source model under the Apache 2.0 license.


Table of Contents


Overview

llm-go is a decoder-only transformer model designed exclusively for Go code generation, completion, and explanation. It understands Go idioms, project layout conventions, the standard library across all major versions, and the most widely used frameworks in the Go ecosystem.

Key goals:

  • Complete coverage of Go 1.0 through 1.24
  • Deep knowledge of Fiber, Cobra, GORM, Gin, Echo, gRPC, and more
  • Enforces canonical Go project layout (cmd/ always at the repo root)
  • Trained on real-world patterns extracted from production Go projects
  • Fully open-source and deployable via the Hugging Face Hub

Architecture

GoLLM is a GPT-style decoder-only transformer with modern improvements from LLaMA/Mistral:

Component Implementation
Attention Multi-head causal self-attention
Positional encoding RoPE (Rotary Position Embedding)
Normalization RMSNorm (pre-norm, before each sub-layer)
Feed-forward SwiGLU activation (silu(gate(x)) * up(x))
Embeddings Tied input/output embeddings
Tokenizer BPE via HuggingFace tokenizers (Rust-backed)
Training precision bfloat16 mixed precision
Multi-GPU TensorFlow MirroredStrategy
Optimizer AdamW + cosine LR schedule with warmup

Special Tokens

The tokenizer uses structural tags so the model understands Go file anatomy:

<go_file>   <go_func>   <go_type>   <go_pkg>   <go_version>
<go_test>   <go_comment>
<task:generate>   <task:complete>   <task:fix>   <task:explain>   <task:optimize>

Model Sizes

Variant Parameters d_model Layers Heads Context Use case
small ~125 M 768 12 12 2 048 CPU / fast iteration
medium ~350 M 1 024 24 16 2 048 Single GPU (default)
large ~760 M 1 280 36 20 4 096 Multi-GPU
xl ~1.5 B 1 600 48 25 4 096 Near state-of-the-art

The default training target is medium. Override with MODEL_SIZE=large make train.


Training Data

Real-world corpus

  • Up to 50 000 Go repositories from GitHub (β‰₯10 stars)
  • Go standard library source across all versions (1.0–1.24)
  • Official documentation and release notes

Synthetic patterns (oversampled)

Patterns extracted from real production Go projects and rendered across multiple Go versions, business domains, and application types:

Category Examples Source
Fiber controllers ~36 Struct-based handlers, constructor injection, Swagger
GORM repositories ~52 UUID PKs, soft delete, repo interface pattern
Service layer ~32 errgroup, DI container, RabbitMQ consumer
JWT / Auth ~16 HS256, bcrypt, Bearer middleware, CPF/CNPJ validators
Tests ~20 go-sqlmock, testify, fiber.App.Test(), table-driven
Docker / CI ~40 Multi-stage Dockerfile, docker-compose, Jenkinsfile
Total ~196

Layout examples are oversampled 5Γ— and pattern examples 3Γ— to reinforce correct conventions.

Deduplication

MinHash LSH with 128 permutations, 32 bands, and a 0.80 Jaccard similarity threshold removes near-duplicate files before tokenization.

Dataset format

Preprocessed data is stored as sharded TFRecord files in data/processed/{train,val,test}/.


Project Structure

llm-go/
β”œβ”€β”€ cmd/                          # (Go convention β€” always at root)
β”œβ”€β”€ configs/
β”‚   β”œβ”€β”€ small.yaml
β”‚   β”œβ”€β”€ medium.yaml
β”‚   └── large.yaml
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/                      # downloaded Go source files
β”‚   β”œβ”€β”€ processed/                # TFRecord shards
β”‚   └── tokenizer/                # trained BPE tokenizer
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ setup_env.sh
β”‚   β”œβ”€β”€ collect_data.sh
β”‚   β”œβ”€β”€ build_tokenizer.sh
β”‚   β”œβ”€β”€ preprocess.sh
β”‚   β”œβ”€β”€ train.sh
β”‚   β”œβ”€β”€ evaluate.sh
β”‚   β”œβ”€β”€ generate.sh
β”‚   └── deploy_huggingface.sh
β”œβ”€β”€ src/llm_go/
β”‚   β”œβ”€β”€ config.py                 # ModelConfig, TrainingConfig, DataConfig
β”‚   β”œβ”€β”€ model/
β”‚   β”‚   β”œβ”€β”€ attention.py          # RoPE + MultiHeadAttention
β”‚   β”‚   └── transformer.py        # RMSNorm, SwiGLU, TransformerBlock, GoLLM
β”‚   β”œβ”€β”€ tokenizer/
β”‚   β”‚   └── go_tokenizer.py       # BPE + structural tag injection
β”‚   β”œβ”€β”€ data/
β”‚   β”‚   β”œβ”€β”€ collector.py          # GitHub + stdlib scraper
β”‚   β”‚   β”œβ”€β”€ preprocessor.py       # filter β†’ dedup β†’ tokenize β†’ TFRecord
β”‚   β”‚   β”œβ”€β”€ go_best_practices.py  # GoProjectTemplates + GoLayoutValidator
β”‚   β”‚   β”œβ”€β”€ templates/
β”‚   β”‚   β”‚   β”œβ”€β”€ loader.py
β”‚   β”‚   β”‚   └── go_project/       # canonical cmd/ layout examples
β”‚   β”‚   └── patterns/
β”‚   β”‚       β”œβ”€β”€ fiber_patterns.py
β”‚   β”‚       β”œβ”€β”€ gorm_patterns.py
β”‚   β”‚       β”œβ”€β”€ service_patterns.py
β”‚   β”‚       β”œβ”€β”€ auth_patterns.py
β”‚   β”‚       β”œβ”€β”€ test_patterns.py
β”‚   β”‚       β”œβ”€β”€ docker_patterns.py
β”‚   β”‚       └── registry.py       # PatternRegistry (~196 examples)
β”‚   β”œβ”€β”€ training/
β”‚   β”‚   β”œβ”€β”€ trainer.py            # gradient accumulation, MirroredStrategy
β”‚   β”‚   └── lr_schedule.py        # CosineWithWarmup
β”‚   β”œβ”€β”€ evaluation/
β”‚   β”‚   └── metrics.py            # perplexity, pass@k, gofmt rate, BLEU, ROUGE-L
β”‚   β”œβ”€β”€ deployment/
β”‚   β”‚   └── hf_uploader.py        # safetensors + model card β†’ HF Hub
β”‚   └── scripts/                  # CLI entry points
β”‚       β”œβ”€β”€ collect.py
β”‚       β”œβ”€β”€ tokenize.py
β”‚       β”œβ”€β”€ train.py
β”‚       β”œβ”€β”€ evaluate.py
β”‚       β”œβ”€β”€ generate.py
β”‚       └── deploy.py
β”œβ”€β”€ tests/
β”‚   β”œβ”€β”€ conftest.py               # shared pytest fixtures
β”‚   β”œβ”€β”€ test_model.py
β”‚   β”œβ”€β”€ test_tokenizer.py
β”‚   └── test_best_practices.py
β”œβ”€β”€ checkpoints/                  # saved during training
β”œβ”€β”€ logs/                         # TensorBoard event files
β”œβ”€β”€ Makefile
β”œβ”€β”€ pyproject.toml
β”œβ”€β”€ requirements.txt
└── requirements-gpu.txt

Requirements

  • Python 3.12
  • TensorFlow 2.17.1 (CPU) or tensorflow[and-cuda] for GPU
  • CUDA 12.x + cuDNN 8.x (optional, GPU only)

Python 3.12 compatibility notes

Package Version Note
tensorflow 2.17.1 cp312 wheel confirmed (manylinux)
keras 3.5.0 compatible with TF 2.17.x
numpy 1.26.4 TF 2.17.x requires numpy < 2
tensorboard 2.17.1 must match TF version
tensorflow-text β€” skipped 2.17.x release; not used (tokenization via HF tokenizers)
tree-sitter optional core pipeline uses regex tagging; see requirements.txt comments

Quick Start

1. Clone and install

git clone https://github.com/your-org/llm-go.git
cd llm-go

# CPU
bash scripts/setup_env.sh

# GPU (NVIDIA CUDA 12)
bash scripts/setup_env.sh --gpu

Or manually:

python3.12 -m venv venv
source venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txt
pip install -e ".[dev]"

2. Generate code (from a pre-trained checkpoint)

# using the Makefile
make generate

# or directly
llm-go-generate \
  --model-dir checkpoints/final \
  --tok-dir   data/tokenizer \
  --prompt    "package main\n\nimport \"github.com/gofiber/fiber/v2\"\n\nfunc main() {"

3. Generate with a Python script

from llm_go.model.transformer import GoLLM
from llm_go.tokenizer.go_tokenizer import GoTokenizer

tok   = GoTokenizer.load("data/tokenizer")
model = GoLLM.from_pretrained("checkpoints/final")

prompt = """<go_version>1.24</go_version>
<go_file>cmd/server/main.go</go_file>
package main

import "github.com/gofiber/fiber/v2"

func main() {"""

ids    = tok.encode(prompt)
output = model.generate(ids, max_new_tokens=256, temperature=0.8, top_p=0.95)
print(tok.decode(output))

Pipeline

Run each stage individually or all at once with make pipeline.

Stage 1 β€” Collect data

export GITHUB_TOKEN=ghp_...
make collect
# or
bash scripts/collect_data.sh

Downloads Go repositories (β‰₯10 stars, configurable) and the standard library into data/raw/.

Stage 2 β€” Build tokenizer

make tokenize
# or
bash scripts/build_tokenizer.sh

Trains a 32 000-token BPE vocabulary on the raw corpus with Go keywords, builtins, and packages seeded as the initial alphabet.

Stage 3 β€” Preprocess

make preprocess
# or
bash scripts/preprocess.sh

Applies quality filtering β†’ MinHash LSH deduplication β†’ PII scrubbing β†’ tokenization β†’ sequence packing β†’ TFRecord sharding.

Synthetic layout and pattern examples are prepended and oversampled before the real data.

Stage 4 β€” Train

# Default: medium model, bfloat16, all available GPUs
make train

# Choose size
make train-small
make train-large
MODEL_SIZE=xl make train

# Custom
MODEL_SIZE=medium BATCH_SIZE=64 MAX_STEPS=200000 bash scripts/train.sh

Training uses XLA JIT compilation, gradient accumulation (default 4 steps), and TensorFlow MirroredStrategy for multi-GPU.

Monitor with TensorBoard:

make tb
# opens http://localhost:6006

Stage 5 β€” Evaluate

make evaluate
# or
bash scripts/evaluate.sh

Reports perplexity, pass@k (unbiased estimator), gofmt syntax pass rate, BLEU, and ROUGE-L.

Stage 6 β€” Deploy to Hugging Face

export HF_TOKEN=hf_...
export HF_REPO_ID=your-org/llm-go-350m

make deploy
# or
bash scripts/deploy_huggingface.sh

Converts Keras weights to SafeTensors format, uploads the tokenizer as PreTrainedTokenizerFast, and generates a model card automatically.


Go Layout Rule

One of the core conventions this model learns and enforces:

cmd/ is always at the project root. Each binary lives in its own subdirectory with a main.go.

my-project/              ← project root
β”œβ”€β”€ cmd/
β”‚   β”œβ”€β”€ server/
β”‚   β”‚   └── main.go      ← binary: server
β”‚   β”œβ”€β”€ worker/
β”‚   β”‚   └── main.go      ← binary: background worker
β”‚   └── cli/
β”‚       └── main.go      ← binary: CLI tool
β”œβ”€β”€ internal/
β”‚   β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ handler/
β”‚   └── service/
β”œβ”€β”€ go.mod
└── go.sum

main.go only wires dependencies. All business logic lives in internal/. The cmd/ directory is never nested inside internal/, pkg/, or any other subdirectory.

The GoLayoutValidator class enforces this during data collection: files from repositories with a nested or missing cmd/ receive a lower training weight.


Supported Frameworks

GoLLM is trained on idiomatic usage of the following libraries:

Framework Purpose
github.com/gofiber/fiber/v2 HTTP server (primary)
github.com/spf13/cobra CLI applications
github.com/spf13/viper Configuration
gorm.io/gorm ORM + PostgreSQL
github.com/gin-gonic/gin HTTP server (alternative)
github.com/labstack/echo HTTP server (alternative)
github.com/go-chi/chi Lightweight HTTP router
google.golang.org/grpc gRPC services
github.com/stretchr/testify Testing assertions
go.uber.org/zap Structured logging
github.com/golang-jwt/jwt JWT authentication
golang.org/x/crypto/bcrypt Password hashing
github.com/rabbitmq/amqp091-go RabbitMQ messaging
github.com/redis/go-redis/v9 Redis client
github.com/prometheus/client_golang Metrics
github.com/DATA-DOG/go-sqlmock SQL mocking in tests

Configuration

Training parameters can be set via environment variables, YAML configs, or Makefile overrides.

# Environment variables (all optional β€” defaults shown)
MODEL_SIZE=medium        # small | medium | large | xl
BATCH_SIZE=32
MAX_STEPS=100000
WARMUP_STEPS=2000
GRAD_ACCUM=4
PRECISION=bfloat16       # float32 | float16 | bfloat16
GPUS=-1                  # -1 = all GPUs, 0 = GPU 0 only
CKPT_DIR=checkpoints
LOG_DIR=logs

YAML configs for each size are in configs/:

# train from a YAML config
llm-go-train --config configs/large.yaml

Evaluation

Metrics computed by GoCodeEvaluator:

Metric Description
Perplexity Cross-entropy exponentiated on the validation split
pass@k Unbiased estimator of functional correctness (k=1,10,100)
gofmt pass rate % of generated files that parse and format without error
BLEU n-gram overlap vs. reference completions
ROUGE-L Longest-common-subsequence F1 vs. references

Deploying to Hugging Face

The uploader (HuggingFaceUploader) handles everything:

  1. Converts Keras weights β†’ SafeTensors
  2. Writes config.json in GPT-2-compatible format
  3. Uploads PreTrainedTokenizerFast (usable with transformers)
  4. Generates a model card with usage examples
  5. Optionally creates a Gradio demo space
export HF_TOKEN=hf_...
export HF_REPO_ID=your-org/llm-go-350m

llm-go-deploy \
  --ckpt-dir checkpoints/final \
  --tok-dir  data/tokenizer \
  --repo-id  "$HF_REPO_ID" \
  --token    "$HF_TOKEN" \
  --public

Once uploaded, use the model from any Python environment:

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("your-org/llm-go-350m")
model     = AutoModelForCausalLM.from_pretrained("your-org/llm-go-350m")

inputs = tokenizer("package main\n\nfunc main() {", return_tensors="pt")
output = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(output[0]))

Development

Run tests

make test
# or
pytest tests/ -v --cov=llm_go --cov-report=term-missing

Lint and format

make lint    # ruff + mypy
make fmt     # black + ruff --fix

Pre-commit hooks

pre-commit install

GPU setup (NVIDIA)

pip install -r requirements-gpu.txt
# verify
python -c "import tensorflow as tf; print(tf.config.list_physical_devices('GPU'))"

License

Apache 2.0 β€” see LICENSE.

Patterns derived from real-world Go projects are used for educational and model-training purposes only. All generated code is original output of the model.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support