burgerbee
/

txtai-sv-wikipedia

Sentence Similarity

Model card Files Files and versions Community

burgerbee commited on Oct 27

Commit

2250d2b

•

1 Parent(s): 98e696b

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ library_name: txtai
 tags:
 - sentence-similarity
 datasets:
-- burgerbee/wikipedia-sv-20240220
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
@@ -16,8 +16,8 @@ This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swed
 Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
 An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
-This index is built from the [Wikipedia Februari 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220).
-Only the first two paragraph from each article is included. The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
 txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
@@ -57,4 +57,4 @@ https://dumps.wikimedia.org/svwiki/
 https://dumps.wikimedia.org/other/pageview_complete/
-https://huggingface.co/datasets/burgerbee/wikipedia-sv-20240220

 tags:
 - sentence-similarity
 datasets:
+- burgerbee/wikipedia-sv-20241020
 ---
 # Wikipedia txtai embeddings index
 This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [Swedish edition of Wikipedia](https://sv.wikipedia.org/).
 Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors.
 An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
+This index is built from the [Wikipedia October 2024 dataset](https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020).
+The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
 to only match commonly visited pages.
 txtai must be (pip) [installed](https://neuml.github.io/txtai/install/) to use this model.
 https://dumps.wikimedia.org/other/pageview_complete/
+https://huggingface.co/datasets/burgerbee/wikipedia-sv-20241020