Update README.md
Browse files
README.md
CHANGED
@@ -8,7 +8,7 @@ library_name: txtai
|
|
8 |
tags:
|
9 |
- sentence-similarity
|
10 |
datasets:
|
11 |
-
- burgerbee/wikipedia-sv-
|
12 |
---
|
13 |
# Wikipedia txtai embeddings index
|
14 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
|
@@ -16,8 +16,8 @@ This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swed
|
|
16 |
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
|
17 |
An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
|
18 |
|
19 |
-
This index is built from the [Wikipedia
|
20 |
-
|
21 |
to only match commonly visited pages.
|
22 |
|
23 |
txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
|
@@ -57,4 +57,4 @@ https://dumps.wikimedia.org/svwiki/
|
|
57 |
|
58 |
https://dumps.wikimedia.org/other/pageview_complete/
|
59 |
|
60 |
-
https://huggingface.co/datasets/burgerbee/wikipedia-sv-
|
|
|
8 |
tags:
|
9 |
- sentence-similarity
|
10 |
datasets:
|
11 |
+
- burgerbee/wikipedia-sv-20241020
|
12 |
---
|
13 |
# Wikipedia txtai embeddings index
|
14 |
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
|
|
|
16 |
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
|
17 |
An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
|
18 |
|
19 |
+
This index is built from the [Wikipedia October 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020).
|
20 |
+
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
|
21 |
to only match commonly visited pages.
|
22 |
|
23 |
txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
|
|
|
57 |
|
58 |
https://dumps.wikimedia.org/other/pageview_complete/
|
59 |
|
60 |
+
https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020
|