Mozilla
/

SFR-Embedding-Mistral-llamafile

Feature Extraction

Model card Files Files and versions Community

SFR-Embedding-Mistral-llamafile / README.md

k8si's picture

Update README.md

d032746 verified 4 months ago

|

history blame contribute delete

3.05 kB

	---
	language:
	- en
	license: cc-by-nc-4.0
	pipeline_tag: feature-extraction
	tags:
	- llamafile
	library_name: llamafile
	base_model:
	- Salesforce/SFR-Embedding-Mistral
	- dranger003/SFR-Embedding-Mistral-GGUF
	model_creator: Salesforce
	quantized_by: dranger003
	---
	# SFR-Embedding-Mistral - llamafile

	This repository contains executable weights (which we call [llamafiles](https://github.com/Mozilla-Ocho/llamafile)) that run on Linux, MacOS, Windows, FreeBSD, OpenBSD, and NetBSD for AMD64 and ARM64.

	- Model creator: [Salesforce](https://huggingface.co/Salesforce)
	- Original model: [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)
	- GGUF weights: [dranger003/SFR-Embedding-Mistral-GGUF](https://huggingface.co/dranger003/SFR-Embedding-Mistral-GGUF)
	- Built with [llamafile 0.8.4](https://github.com/Mozilla-Ocho/llamafile/releases/tag/0.8.4)

	## Quickstart

	Running the following on a desktop OS will launch a server on `http://localhost:8080` to which you can send HTTP requests to in order to get embeddings:

	```
	chmod +x ggml-sfr-embedding-mistral-f16.llamafile
	./ggml-sfr-embedding-mistral-f16.llamafile --server --nobrowser --embedding
	```

	Then, you can use your favorite HTTP client to call the server's `/embedding` endpoint:

	```
	curl \
	-X POST \
	-H "Content-Type: application/json" \
	-d '{"content": "Hello, world!"}' \
	http://localhost:8080/embedding
	```

	For further information, please see the [llamafile README](https://github.com/mozilla-ocho/llamafile/) and the [llamafile server docs](https://github.com/Mozilla-Ocho/llamafile/blob/main/llama.cpp/server/README.md).

	Having trouble? See the ["Gotchas" section](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas) of the README or contact us on [Discord](https://discord.com/channels/1089876418936180786/1182689832057716778).

	## About llamafile

	llamafile is a new format introduced by Mozilla Ocho on Nov 20th 2023.
	It uses Cosmopolitan Libc to turn LLM weights into runnable llama.cpp
	binaries that run on the stock installs of six OSes for both ARM64 and
	AMD64.

	## About Quantization Formats

	Your choice of quantization format depends on three things:

	1. Will it fit in RAM or VRAM?
	2. Is your use case reading (e.g. summarization) or writing (e.g. chatbot)?
	3. llamafiles bigger than 4.30 GB are hard to run on Windows (see [gotchas](https://github.com/mozilla-ocho/llamafile/?tab=readme-ov-file#gotchas))

	Good quants for writing (eval speed) are Q5\_K\_M, and Q4\_0. Text
	generation is bounded by memory speed, so smaller quants help, but they
	also cause the LLM to hallucinate more.

	Good quants for reading (prompt eval speed) are BF16, F16, Q4\_0, and
	Q8\_0 (ordered from fastest to slowest). Prompt evaluation is bounded by
	computation speed (flops) so simpler quants help.

	Note: BF16 is currently only supported on CPU.

	See also: https://huggingface.co/docs/hub/en/gguf#quantization-types

	---

	# Model Card

	See [Salesforce/SFR-Embedding-Mistral](https://huggingface.co/Salesforce/SFR-Embedding-Mistral)