Add model

Browse files

Files changed (5) hide show

.gitattributes +2 -0
README.md +51 -1
config.json +28 -0
documents +3 -0
embeddings +3 -0

.gitattributes CHANGED Viewed

@@ -32,3 +32,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+documents filter=lfs diff=lfs merge=lfs -text
+embeddings filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,3 +1,53 @@
 ---
-license: cc-by-sa-3.0
 ---

 ---
+inference: false
+language: en
+license:
+- cc-by-sa-3.0
+- gfdl
+library_name: txtai
+tags:
+- sentence-similarity
+datasets:
+- olm/olm-wikipedia-20221220
 ---
+# Wikipedia txtai embeddings index
+This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
+This index is built from the [OLM Wikipedia December 2022 dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).
+Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index.
+This is similar to an abstract of the article.
+It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
+to only match commonly visited pages.
+txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.
+## Example
+Version 5.4 added support for loading embeddings indexes from the Hugging Face Hub. See the example below.
+```python
+from txtai.embeddings import Embeddings
+# Load the index from the HF Hub
+embeddings = Embeddings()
+embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")
+# Run a search
+embeddings.search("Roman Empire")
+# Run a search matching only the Top 1% of articles
+embeddings.search("""
+   SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
+   percentile >= 0.99
+""")
+```
+## Use Cases
+An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
+The Wikipedia index works well as a fact-based context source for conversational search. In other words, search results from this model can be passed to LLM prompts as the
+context in which to answer questions.

config.json ADDED Viewed

	@@ -0,0 +1,28 @@

+{
+  "format": "json",
+  "path": "intfloat/e5-base",
+  "instructions": {
+    "query": "query: ",
+    "data": "passage: "
+  },
+  "batch": 8192,
+  "encodebatch": 128,
+  "faiss": {
+    "quantize": true,
+    "sample": 0.05
+  },
+  "content": true,
+  "dimensions": 768,
+  "backend": "faiss",
+  "offset": 6013092,
+  "build": {
+    "create": "2023-02-20T21:57:46Z",
+    "python": "3.7.16",
+    "settings": {
+      "components": "IVF2193,SQ8"
+    },
+    "system": "Linux (x86_64)",
+    "txtai": "5.4.0"
+  },
+  "update": "2023-02-20T21:57:46Z"
+}

documents ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fecb90d975ac6d48caabd1b7a5a4b94350b1af6c052e28dbc6fab4afa6051708
+size 3138019328

embeddings ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb7bd5798472eb259459edf5037231198814b639649239b05799253db2df8529
+size 4672920160