burgerbee commited on
Commit
2250d2b
1 Parent(s): 98e696b

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -8,7 +8,7 @@ library_name: txtai
8
  tags:
9
  - sentence-similarity
10
  datasets:
11
- - burgerbee/wikipedia-sv-20240220
12
  ---
13
  # Wikipedia txtai embeddings index
14
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
@@ -16,8 +16,8 @@ This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swed
16
  Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
17
  An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
18
 
19
- This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220).
20
- Only the first two paragraph from each article is included. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
21
  to only match commonly visited pages.
22
 
23
  txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
@@ -57,4 +57,4 @@ https://dumps.wikimedia.org/svwiki/
57
 
58
  https://dumps.wikimedia.org/other/pageview_complete/
59
 
60
- https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220
 
8
  tags:
9
  - sentence-similarity
10
  datasets:
11
+ - burgerbee/wikipedia-sv-20241020
12
  ---
13
  # Wikipedia txtai embeddings index
14
  This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
 
16
  Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
17
  An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
18
 
19
+ This index is built from the [Wikipedia October 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020).
20
+ The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
21
  to only match commonly visited pages.
22
 
23
  txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
 
57
 
58
  https://dumps.wikimedia.org/other/pageview_complete/
59
 
60
+ https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020