ReaderLM-v2 GGUF Quantized Models for llama.cpp
This repository contains GGUF quantized versions of the ReaderLM-v2 model by Jina AI. These models are optimized for llama.cpp, making them efficient to run on CPUs and GPUs.
Model Information
ReaderLM-v2 is a 1.5 billion parameter model designed for HTML-to-Markdown and HTML-to-JSON conversion. It supports 29 languages and can handle up to 512,000 tokens in combined input and output length.
The model is useful for extracting structured data from web pages and various NLP applications.
Available Quantized Models
Model File | Quantization Type | Size | Description |
---|---|---|---|
ReaderLM-v2-Q4_K_M.gguf |
Q4_K_M | 986MB | Lower precision, optimized for CPU performance |
ReaderLM-v2-Q8_0.gguf |
Q8_0 | 1.6GB | Higher precision, better quality |
These quantized versions balance performance and accuracy, making them suitable for different hardware setups.
Usage
Running the Model with llama.cpp
Clone and build llama.cpp:
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp mkdir build && cd build cmake .. make -j$(nproc)
Run the model:
./llama-cli --model ReaderLM-v2-Q4_K_M.gguf --no-conversation --no-display-prompt --temp 0 --prompt '<|im_start|>system Convert the HTML to Markdown. <|im_end|> <|im_start|>user <html><body><h1>Hello, world!</h1></body></html> <|im_end|> <|im_start|>assistant' 2>/dev/null
Replace
ReaderLM-v2-Q4_K_M.gguf
withReaderLM-v2-Q8_0.gguf
for better quality at the cost of performance.
Using the Model in Python with llama-cpp-python
pip install llama-cpp-python
model_path = "./models/ReaderLM-v2-Q4_K_M.gguf"
llm = Llama(model_path=model_path, chat_format="chatml")
output = llm.create_chat_completion(
messages = [
{"role": "system", "content": "Convert the HTML to Markdown."},
{
"role": "user",
"content": "<html><body><h1>Hello, world!</h1><p>This is a test!</p></body></html>"
}
],
temperature=0.1,
)
print(output['choices'][0]['message']['content'].strip())
Hardware Requirements
- Q4_K_M (986MB): Runs well on CPUs with 8GB RAM or more
- Q8_0 (1.6GB): Requires 16GB RAM for smooth performance
For GPU acceleration, compile llama.cpp
with CUDA support.
Credits
- Original Model: Jina AI - ReaderLM-v2
- Quantization: Performed using llama.cpp
License
This model is released under Creative Commons Attribution-NonCommercial 4.0 (CC-BY-NC-4.0). See LICENSE for details.
Last updated: January 31, 2025
- Downloads last month
- 16