NeuML
/

txtai-wikipedia

Sentence Similarity

English

txtai

Model card Files Files and versions Community

davidmezzetti commited on Sep 13, 2024

Commit

99e1eb8

1 Parent(s): 576db1d

September 2024 data update

Browse files

Files changed (4) hide show

README.md +6 -6
config.json +6 -6
documents +2 -2
embeddings +2 -2

README.md CHANGED Viewed

@@ -8,14 +8,14 @@ library_name: txtai
 tags:
 - sentence-similarity
 datasets:
-- NeuML/wikipedia-20240101
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
-This index is built from the [Wikipedia January 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240101). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
 It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
@@ -24,7 +24,7 @@ txtai must be [installed](https://neuml.github.io/txtai/install/) to use this mo
 ## Example
-Version 5.4 added support for loading embeddings indexes from the Hugging Face Hub. See the example below.
 ```python
 from txtai.embeddings import Embeddings
@@ -75,7 +75,7 @@ pip install txtchat mwparserfromhell datasets
 - Download and build pageviews database
 ```bash
 mkdir -p pageviews/data
-wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2023/2023-12/pageviews-202312-user.bz2
 python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
 ```
@@ -85,7 +85,7 @@ python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
 from datasets import load_dataset
 # Data dump date from https://dumps.wikimedia.org/enwiki/
-date = "20240101"
 # Build and save dataset
 ds = load_dataset("neuml/wikipedia", language="en", date=date)
@@ -95,7 +95,7 @@ ds.save_to_disk(f"wikipedia-{date}")
 - Build txtai-wikipedia index
 ```bash
 python -m txtchat.data.wikipedia.index \
-       -d wikipedia-20240101 \
        -o txtai-wikipedia \
        -v pageviews/pageviews.sqlite
 ```

 tags:
 - sentence-similarity
 datasets:
+- NeuML/wikipedia-20240901
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
+This index is built from the [Wikipedia September 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240901). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
 It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
 ## Example
+See the example below. This index requires txtai >= 7.4.
 ```python
 from txtai.embeddings import Embeddings
 - Download and build pageviews database
 ```bash
 mkdir -p pageviews/data
+wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-08/pageviews-202408-user.bz2
 python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
 ```
 from datasets import load_dataset
 # Data dump date from https://dumps.wikimedia.org/enwiki/
+date = "20240901"
 # Build and save dataset
 ds = load_dataset("neuml/wikipedia", language="en", date=date)
 - Build txtai-wikipedia index
 ```bash
 python -m txtchat.data.wikipedia.index \
+       -d wikipedia-20240901 \
        -o txtai-wikipedia \
        -v pageviews/pageviews.sqlite
 ```

config.json CHANGED Viewed

@@ -14,15 +14,15 @@
   "content": true,
   "dimensions": 768,
   "backend": "faiss",
-  "offset": 6172387,
   "build": {
-    "create": "2024-01-10T20:34:13Z",
-    "python": "3.8.18",
     "settings": {
-      "components": "IVF2222,SQ8"
     },
     "system": "Linux (x86_64)",
-    "txtai": "6.4.0"
   },
-  "update": "2024-01-10T20:34:13Z"
 }

   "content": true,
   "dimensions": 768,
   "backend": "faiss",
+  "offset": 6272285,
   "build": {
+    "create": "2024-09-12T17:06:38Z",
+    "python": "3.8.19",
     "settings": {
+      "components": "IVF2240,SQ8"
     },
     "system": "Linux (x86_64)",
+    "txtai": "7.4.0"
   },
+  "update": "2024-09-12T17:06:38Z"
 }

documents CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:3074eee183918963965a7cdb7c6371150131bf9d99e04d141b20095cd8183b2c
-size 3237478400

 version https://git-lfs.github.com/spec/v1
+oid sha256:a54c4473c76d15a7bf2d1b8a1b590d3aaeacc0426324c4a5b1d886d729a43b92
+size 3292749824

embeddings CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6f4dea2a142773ae579125431036c7adb632e597a7cd9ed04fbe5473e5f83201
-size 4796622400

 version https://git-lfs.github.com/spec/v1
+oid sha256:47bcc78a1602223ac3b2b8355b3c244a62fb2e4d0e2f04cdc8a8d0a865692b35
+size 4874198688