embedfile
Experimental CLI tool for generating and searching text embeddings, built on
llamafile,
sqlite-vec
,
sqlite-lembed
,
the SQLite CLI,
and a few other SQLite extensions.
Model | embedfile | Size (f16 quant) |
---|---|---|
sentence-transformers/all-MiniLM-L6-v2 | all-MiniLM-L6-v2.f16.embedfile |
56MB |
mixedbread-ai/mxbai-embed-xsmall-v1 | mxbai-embed-xsmall-v1-f16.embedfile |
61MB |
nomic-ai/nomic-embed-text-v1.5 | nomic-embed-text-v1.5.f16.embedfile |
273MB |
snowflake-arctic-embed-m-v1.5 | snowflake-arctic-embed-m-v1.5-f16.embedfile |
221MB |
- | embedfile (no embedded model) |
12MB |
embedfiles run on Linux, Mac, and Windows computers in the same file, thanks to cosmopolitan.
You can embed data from CSVs, JSON, NDJSON, and txt files from the CLI, or "eject" to the sqlite3
CLI at any time.
Here's an example, using MixedBread's xsmall model:
$ wget https://huggingface.co/asg017/embedfile/resolve/main/mxbai-embed-xsmall-v1-f16.embedfile
$ chmod u+x mxbai-embed-xsmall-v1-f16.embedfile
$ ./mxbai-embed-xsmall-v1-f16.embedfile --version
embedfile 0.0.1-alpha.1, llamafile 0.8.16, SQLite 3.47.0, sqlite-vec=v0.1.6, sqlite-lembed=v0.0.1-alpha.8
This executable file already has sqlite-vec
, sqlite-lembed
, and the embeddings model pre-configured. Test that embeddings work with:
./mxbai-embed-xsmall-v1-f16.embedfile embed 'hello!'
[-0.058174,0.043776,0.030660,...]
You can embed data from CSV, JSON, NDJSON, and .txt files and save the results to a SQLite database. Here we are embedding the text
column in the dbpedia.min.csv
file, outputting to a dbpedia.db
database.
$ ./mxbai-embed-xsmall-v1-f16.embedfile import --embed text dbpedia.min.csv dbpedia.db
INSERT INTO vec_items SELECT rowid, lembed("text") FROM temp.source;
100%|ββββββββββββββββββββ| 10000/10000 [02:00<00:00, 83/s]
β dbpedia.min.csv imported into dbpedia.db, 10000 items
That was 10,000 rows with 820,604 tokens. I got 83 embeddings per second on my older 2019 Intel Macbook. On my M1 Mac Mini I get 173 embbedings/second, and I'm sure it's faster on newer macs.
Once indexed, you can search with the search
command:
$ ./mxbai-embed-xsmall-v1-f16.embedfile search dbpedia.db 'global warming'
3240 0.852299 Attribution of recent climate change is the effort to scientifically ascertain mechanisms ...
6697 0.904844 The global warming controversy concerns the public debate over whether global warming is occurring, how ...
...
At any point, if you want to "eject" and run SQL scripts yourself, the sh
command will fire up the sqlite3
CLI with all extensions and embeddings models pre-configured.
$ ./mxbai-embed-xsmall-v1-f16.embedfile sh
SQLite version 3.47.0 2024-10-21 16:30:22
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite> .mode qbox
sqlite> select sqlite_version(), vec_version(), lembed_version();
ββββββββββββββββββββ¬ββββββββββββββββ¬βββββββββββββββββββ
β sqlite_version() β vec_version() β lembed_version() β
ββββββββββββββββββββΌββββββββββββββββΌβββββββββββββββββββ€
β '3.47.0' β 'v0.1.6' β 'v0.0.1-alpha.8' β
ββββββββββββββββββββ΄ββββββββββββββββ΄βββββββββββββββββββ
sqlite> select vec_to_json(vec_slice(lembed('hello!'), 0, 8)) as sample;
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β sample β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β '[-0.058174,0.043776,0.030660,0.047412,-0.059377,-0.036267,0 β
β .038117,0.005184]' β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ