language:
- en
license: mit
pipeline_tag: feature-extraction
tags:
- llamafile
library_name: llamafile
base_model:
- intfloat/e5-mistral-7b-instruct
- second-state/E5-Mistral-7B-Instruct-Embedding-GGUF
model_creator: intfloat
quantized_by: Second State Inc.
e5-mistral-7b-instruct - llamafile
This repository contains executable weights (which we call llamafiles) that run on Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64.
- Model creator: intfloat
- Original model: intfloat/e5-mistral-7b-instruct
- GGUF weights: second-state/E5-Mistral-7B-Instruct-Embedding-GGUF
- Built with llamafile 0.8.4
Quickstart
Running the following on a desktop OS will launch a server on http://localhost:8080
to which you can send HTTP requests to in order to get embeddings:
chmod +x e5-mistral-7b-instruct-Q5_K_M.llamafile
./e5-mistral-7b-instruct-Q5_K_M.llamafile --server --nobrowser --embedding
Then, you can use your favorite HTTP client to call the server's /embedding
endpoint:
curl \
-X POST \
-H "Content-Type: application/json" \
-d '{"content": "Hello, world!"}' \
http://localhost:8080/embedding
For further information, please see the llamafile README and the llamafile server docs.
Having trouble? See the "Gotchas" section of the README or contact us on Discord.
About llamafile
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp binaries that run on the stock installs of six OSes for both ARM64 and AMD64.
About Quantization Formats
Your choice of quantization format depends on three things:
- Will it fit in RAM or VRAM?
- Is your use case reading (e.g. summarization) or writing (e.g. chatbot)?
- llamafiles bigger than 4.30 GB are hard to run on Windows (see gotchas)
Good quants for writing (eval speed) are Q5_K_M, and Q4_0. Text generation is bounded by memory speed, so smaller quants help, but they also cause the LLM to hallucinate more.
Good quants for reading (prompt eval speed) are BF16, F16, Q4_0, and Q8_0 (ordered from fastest to slowest). Prompt evaluation is bounded by computation speed (flops) so simpler quants help.
Note: BF16 is currently only supported on CPU.
See also: https://huggingface.co/docs/hub/en/gguf#quantization-types