Instructions to use barha/granite-switch-4.0-350m-cti with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use barha/granite-switch-4.0-350m-cti with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="barha/granite-switch-4.0-350m-cti") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("barha/granite-switch-4.0-350m-cti", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use barha/granite-switch-4.0-350m-cti with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "barha/granite-switch-4.0-350m-cti" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barha/granite-switch-4.0-350m-cti", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/barha/granite-switch-4.0-350m-cti
- SGLang
How to use barha/granite-switch-4.0-350m-cti with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "barha/granite-switch-4.0-350m-cti" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barha/granite-switch-4.0-350m-cti", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "barha/granite-switch-4.0-350m-cti" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "barha/granite-switch-4.0-350m-cti", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use barha/granite-switch-4.0-350m-cti with Docker Model Runner:
docker model run hf.co/barha/granite-switch-4.0-350m-cti
Granite Switch 4.0 350M — CTI Technique Mapping
A Granite Switch model: the
ibm-granite/granite-4.0-350m base with a
single LoRA adapter — cti_technique_mapping — embedded into one switchable checkpoint and
fired by a control token.
It is the small-model analogue of
ibm-granite/granite-switch-4.1-3b-preview:
same Granite Switch composition machinery (control tokens, KV-hiding, chat-template integration),
but built on the 350M Granite 4.0 base and carrying one CTI adapter instead of the preview's
granitelib adapter library.
What the adapter does
cti_technique_mapping maps a piece of cyber threat intelligence (CTI) text — a sentence or short
passage describing adversary behavior — to the single best-matching
MITRE ATT&CK technique ID (e.g. T1059, T1566.001).
The adapter's I/O contract (io_configs/cti_technique_mapping/io.yaml) constrains the output to a
single technique-ID string matching ^T[0-9]{4}(\.[0-9]{3})?$, greedy decoding, max_completion_tokens: 16.
The underlying LoRA scored 96.67% exact-match (290/300) on the held-out CTI validation set.
Composition summary
| Base model | ibm-granite/granite-4.0-350m (granitemoehybrid) |
| Embedded adapters | 1 — cti_technique_mapping (LoRA) |
| Control token | <|cti_technique_mapping|> (id 100352) |
| LoRA rank / alpha | 16 / 32 |
| Target modules | q_proj, k_proj, v_proj, o_proj, input_linear, output_linear |
| Base params | 352,379,904 |
| Composed params | 355,362,816 (+0.85%) |
Usage
The control token activates the adapter. With the Granite Switch HF backend:
from granite_switch.hf import GraniteSwitchForCausalLM
from transformers import AutoTokenizer
model_id = "barha/granite-switch-4.0-350m-cti"
tok = AutoTokenizer.from_pretrained(model_id)
model = GraniteSwitchForCausalLM.from_pretrained(model_id, device_map="auto")
cti = "The actor used PowerShell to download and execute a payload from a remote server."
messages = [{"role": "user", "content": cti}]
# The chat template inserts <|cti_technique_mapping|> to fire the adapter.
inputs = tok.apply_chat_template(
messages, add_generation_prompt=True, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=16, do_sample=False)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))
# -> e.g. "T1059.001"
For fast inference, deploy with vLLM (see the granite-switch docs).
How it was built
Composed with granite_switch.composer.compose_granite_switch:
python -m granite_switch.composer.compose_granite_switch \
--base-model ibm-granite/granite-4.0-350m \
--adapters <path-to>/cti_technique_mapping \
--technology lora \
--output <out>
Note on
granitemoehybrid: all Granite 4.0 / Nano models aregranitemoehybridconfigs (withnum_local_experts=0), whose MLP leaves are the fusedinput_linear/output_linearrather than densegate/up/down_proj. The CTI LoRA was therefore trained targetinginput_linear/output_linear(plus attentionq/k/v/o_proj) so it composes cleanly — a scalar/all-lineartarget would produce phantomgate/up/down_projweights the composer rejects.
Files
model.safetensors— composed base + embedded LoRA weightsconfig.json—model_type: granite_switchadapter_index.json— adapter → control-token mappingio_configs/cti_technique_mapping/io.yaml— adapter I/O contract (output schema, decoding params)chat_template.jinja— control-token-aware chat templatecompose_report.json,BUILD.md— full composition provenance
License
Apache-2.0.
- Downloads last month
- 17
Model tree for barha/granite-switch-4.0-350m-cti
Base model
ibm-granite/granite-4.0-350m-base