|
--- |
|
language: |
|
- en |
|
license: cc-by-nc-4.0 |
|
pipeline_tag: feature-extraction |
|
tags: |
|
- llamafile |
|
library_name: llamafile |
|
base_model: |
|
- Salesforce/SFR-Embedding-Mistral |
|
- dranger003/SFR-Embedding-Mistral-GGUF |
|
model_creator: Salesforce |
|
quantized_by: dranger003 |
|
--- |
|
# SFR-Embedding-Mistral - llamafile |
|
|
|
This repository contains executable weights (which we call [llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64. |
|
|
|
- Model creator: [Salesforce](https://huggingface.co/Salesforce) |
|
- Original model: [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) |
|
- GGUF weights: [dranger003/SFR-Embedding-Mistral-GGUF](https://huggingface.co/dranger003/SFR-Embedding-Mistral-GGUF) |
|
- Built with [llamafile 0.8.4](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.4) |
|
|
|
## Quickstart |
|
|
|
Running the following on a desktop OS will launch a server on `http://localhost:8080` to which you can send HTTP requests to in order to get embeddings: |
|
|
|
``` |
|
chmod +x ggml-sfr-embedding-mistral-f16.llamafile |
|
./ggml-sfr-embedding-mistral-f16.llamafile --server --nobrowser --embedding |
|
``` |
|
|
|
Then, you can use your favorite HTTP client to call the server's `/embedding` endpoint: |
|
|
|
``` |
|
curl \ |
|
-X POST \ |
|
-H "Content-Type: application/json" \ |
|
-d '{"content": "Hello, world!"}' \ |
|
http://localhost:8080/embedding |
|
``` |
|
|
|
For further information, please see the [llamafile README](https://github.com/mozilla-ocho/llamafile/) and the [llamafile server docs](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md). |
|
|
|
Having **trouble?** See the ["Gotchas" section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas) of the README or contact us on [Discord](https://discord.com/channels/1089876418936180786/1182689832057716778). |
|
|
|
## About llamafile |
|
|
|
llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023. |
|
It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp |
|
binaries that run on the stock installs of six OSes for both ARM64 and |
|
AMD64. |
|
|
|
## About Quantization Formats |
|
|
|
Your choice of quantization format depends on three things: |
|
|
|
1. Will it fit in RAM or VRAM? |
|
2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)? |
|
3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas)) |
|
|
|
Good quants for writing (eval speed) are Q5\_K\_M, and Q4\_0. Text |
|
generation is bounded by memory speed, so smaller quants help, but they |
|
also cause the LLM to hallucinate more. |
|
|
|
Good quants for reading (prompt eval speed) are BF16, F16, Q4\_0, and |
|
Q8\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by |
|
computation speed (flops) so simpler quants help. |
|
|
|
Note: BF16 is currently only supported on CPU. |
|
|
|
See also: https://huggingface.co/docs/hub/en/gguf#quantization-types |
|
|
|
--- |
|
|
|
# Model Card |
|
|
|
See [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral) |