davidmezzetti commited on
Commit
459d697
1 Parent(s): 4379b79

January 2024 data update

Browse files
Files changed (4) hide show
  1. README.md +40 -6
  2. config.json +6 -6
  3. documents +2 -2
  4. embeddings +2 -2
README.md CHANGED
@@ -8,16 +8,14 @@ library_name: txtai
8
  tags:
9
  - sentence-similarity
10
  datasets:
11
- - olm/olm-wikipedia-20221220
12
  ---
13
 
14
  # Wikipedia txtai embeddings index
15
 
16
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
17
 
18
- This index is built from the [OLM Wikipedia December 2022 dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).
19
- Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index.
20
- This is similar to an abstract of the article.
21
 
22
  It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
23
  to only match commonly visited pages.
@@ -49,7 +47,43 @@ embeddings.search("""
49
 
50
  An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
51
 
52
- The Wikipedia index works well as a fact-based context source for conversational search. In other words, search results from this model can be passed to LLM prompts as the
53
- context in which to answer questions.
54
 
55
  See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  tags:
9
  - sentence-similarity
10
  datasets:
11
+ - neuml/wikipedia-20240101
12
  ---
13
 
14
  # Wikipedia txtai embeddings index
15
 
16
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).
17
 
18
+ This index is built from the [Wikipedia January 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240101). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.
 
 
19
 
20
  It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
21
  to only match commonly visited pages.
47
 
48
  An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.
49
 
50
+ The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.
 
51
 
52
  See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.
53
+
54
+ ## Build the index
55
+
56
+ The following steps show how to build this index. These scripts are using the latest data available as of 2024-01-01, update as appropriate.
57
+
58
+ - Install required build dependencies
59
+ ```bash
60
+ pip install txtchat mwparserfromhell datasets
61
+ ```
62
+
63
+ - Download and build pageviews database
64
+ ```bash
65
+ mkdir -p pageviews/data
66
+ wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2023/2023-12/pageviews-202312-user.bz2
67
+ python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
68
+ ```
69
+
70
+ - Build Wikipedia dataset
71
+
72
+ ```python
73
+ from datasets import load_dataset
74
+
75
+ # Data dump date from https://dumps.wikimedia.org/enwiki/
76
+ date = "20240101"
77
+
78
+ # Build and save dataset
79
+ ds = load_dataset("neuml/wikipedia", language="en", date=date)
80
+ ds.save_to_disk(f"wikipedia-{date}")
81
+ ```
82
+
83
+ - Build txtai-wikipedia index
84
+ ```bash
85
+ python -m txtchat.data.wikipedia.index \
86
+ -d wikipedia-20240101 \
87
+ -o txtai-wikipedia \
88
+ -v pageviews/pageviews.sqlite
89
+ ```
config.json CHANGED
@@ -14,15 +14,15 @@
14
  "content": true,
15
  "dimensions": 768,
16
  "backend": "faiss",
17
- "offset": 6013092,
18
  "build": {
19
- "create": "2023-02-20T21:57:46Z",
20
- "python": "3.7.16",
21
  "settings": {
22
- "components": "IVF2193,SQ8"
23
  },
24
  "system": "Linux (x86_64)",
25
- "txtai": "5.4.0"
26
  },
27
- "update": "2023-02-20T21:57:46Z"
28
  }
14
  "content": true,
15
  "dimensions": 768,
16
  "backend": "faiss",
17
+ "offset": 6172387,
18
  "build": {
19
+ "create": "2024-01-10T20:34:13Z",
20
+ "python": "3.8.18",
21
  "settings": {
22
+ "components": "IVF2222,SQ8"
23
  },
24
  "system": "Linux (x86_64)",
25
+ "txtai": "6.4.0"
26
  },
27
+ "update": "2024-01-10T20:34:13Z"
28
  }
documents CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fecb90d975ac6d48caabd1b7a5a4b94350b1af6c052e28dbc6fab4afa6051708
3
- size 3138019328
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:3074eee183918963965a7cdb7c6371150131bf9d99e04d141b20095cd8183b2c
3
+ size 3237478400
embeddings CHANGED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:fb7bd5798472eb259459edf5037231198814b639649239b05799253db2df8529
3
- size 4672920160
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:6f4dea2a142773ae579125431036c7adb632e597a7cd9ed04fbe5473e5f83201
3
+ size 4796622400