Instructions to use AvoCahDoe/llama-3-8b-rlmpq with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use AvoCahDoe/llama-3-8b-rlmpq with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="AvoCahDoe/llama-3-8b-rlmpq")# Load model directly from transformers import AutoTokenizer, AutoModelForMultimodalLM tokenizer = AutoTokenizer.from_pretrained("AvoCahDoe/llama-3-8b-rlmpq") model = AutoModelForMultimodalLM.from_pretrained("AvoCahDoe/llama-3-8b-rlmpq") - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use AvoCahDoe/llama-3-8b-rlmpq with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "AvoCahDoe/llama-3-8b-rlmpq" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AvoCahDoe/llama-3-8b-rlmpq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/AvoCahDoe/llama-3-8b-rlmpq
- SGLang
How to use AvoCahDoe/llama-3-8b-rlmpq with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "AvoCahDoe/llama-3-8b-rlmpq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AvoCahDoe/llama-3-8b-rlmpq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "AvoCahDoe/llama-3-8b-rlmpq" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "AvoCahDoe/llama-3-8b-rlmpq", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use AvoCahDoe/llama-3-8b-rlmpq with Docker Model Runner:
docker model run hf.co/AvoCahDoe/llama-3-8b-rlmpq
Llama 3 8B β RL-MPQ Quantized (Thesis Release)
Quantized variant of meta-llama/Meta-Llama-3-8B using RL-MPQ (Reinforcement Learning Mixed-Precision Quantization): per-layer bit-width policies trained with PPO, validated on WikiText-2 perplexity.
This repo ships five compression scenarios as subfolders β from near-FP16 fidelity to aggressive survival mode β so you can pick the bits-vs-quality trade-off for your thesis experiments.
| Base model | meta-llama/Meta-Llama-3-8B |
| Method | RL-MPQ (PPO per-layer bit policy) |
| Format | Fake-quant FP16 weights + rlmpq_policy.json |
| Recommended start | subfolder="Balanced" |
Scenarios
High_Fidelity/β load withsubfolder="High_Fidelity"Conservative/β load withsubfolder="Conservative"Balanced/β load withsubfolder="Balanced"Aggressive/β load withsubfolder="Aggressive"Extreme_Survival/β load withsubfolder="Extreme_Survival"
Results (WikiText-2)
| Scenario | Avg bits | Compression vs FP16 | Perplexity |
|---|---|---|---|
| High_Fidelity | 6.875 | 2.3273x | 5.8133 |
| Conservative | 5.25 | 3.0476x | 5.9652 |
| Balanced | 4.5 | 3.5556x | 6.0199 |
| Aggressive | 3.7188 | 4.3025x | 6.6793 |
| Extreme_Survival | 2.875 | 5.5652x | 32.393 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "AvoCahDoe/llama-3-8b-rlmpq"
scenario = "Balanced" # High_Fidelity | Conservative | Aggressive | Extreme_Survival
model = AutoModelForCausalLM.from_pretrained(repo, subfolder=scenario, torch_dtype="float16")
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder=scenario)
Important: Always pass
subfolder=<scenario>. Rootconfig.jsondescribes the collection; weights and tokenizer live inside each scenario folder.
Method (thesis summary)
- Phase 3 β PPO agent selects per-layer bit widths under scenario-specific reward targets.
- Phase 4 β Policies replayed on real weights; WikiText-2 PPL measures quality retention.
- Export β Fake-quantized FP16 checkpoints (compatible with Hugging Face Transformers).
Citation
@misc{rlmpq2026,
title = {RL-MPQ: Reinforcement Learning Mixed-Precision Quantization},
author = {AvoCahDoe},
year = {2026},
url = {https://huggingface.co/AvoCahDoe/llama-3-8b-rlmpq}
}
Part of the RL-NMP-Model-Quantasation thesis framework.
- Downloads last month
- 50
Model tree for AvoCahDoe/llama-3-8b-rlmpq
Base model
meta-llama/Meta-Llama-3-8B