FineLlama-3.2-3B-Instruct-ead-GGUF
GGUF quantized versions of Geraldine/FineLlama-3.2-3B-Instruct-ead model, optimized for efficient inference using llama.cpp.
Model Description
- Base Model: FineLlama-3.2-3B-Instruct-ead
- Quantization: Various GGUF formats
- Purpose: EAD tag generation and archival metadata encoding
- Framework: llama.cpp
Available Variants
The following quantized versions are available:
- Q2_K variant (1.36 GB)
- Q3_K_M variant (1.69 GB)
- Q4_K_M variant (2.02 GB)
- Q5_K_M variant (2.32 GB)
- Q6_K variant (2.64 GB)
- Q8_0 variant (3.42 GB)
- FP16 variant (6.43 GB)
Installation
- Download the desired GGUF model variant
- Install llama.cpp following the official instructions
- Place the model file in your llama.cpp models directory
Usage
# Example using Q4_K_M quantization
./main -m models/FineLlama-3.2-3B-Instruct-ead-Q4_K_M.gguf -n 1024 --repeat_penalty 1.1
# Example using server mode
./server -m models/FineLlama-3.2-3B-Instruct-ead-Q4_K_M.gguf -c 4096
Example using llama-cpp-python library
from llama_cpp import Llama
query = "..."
llm = Llama.from_pretrained(
repo_id="Geraldine/FineLlama-3.2-3B-Instruct-ead-GGUF",
filename="*Q8_0.gguf",
n_ctx=1024,
verbose=False
)
output = llm.create_chat_completion(
messages = [
{"role": "system", "content": "You are an archivist expert in EAD format."},
{
"role": "user",
"content": query
}
]
)
print(output["choices"][0]["message"]["content"])
Example using Ollama
ollama run hf.co/Geraldine/FineLlama-3.2-3B-Instruct-ead-GGUF:Q4_K_M
Quantization Details
- Q2_K: 2-bit quantization, optimized for efficiency
- Q3_K_M: 3-bit quantization with medium precision
- Q4_K_M: 4-bit quantization with medium precision
- Q5_K_M: 5-bit quantization with medium precision
- Q6_K: 6-bit quantization
- Q8_0: 8-bit quantization, highest precision among quantized versions
- FP16: Full 16-bit floating point, no quantization
Performance Considerations
- Lower bit quantizations (Q2_K, Q3_K_M) offer smaller file sizes but may have slightly reduced accuracy
- Higher bit quantizations (Q6_K, Q8_0) provide better accuracy but require more storage and memory
- FP16 provides full precision but requires significantly more resources
- Downloads last month
- 414