Wikipedia txtai embeddings index
This is a txtai embeddings index (5GB embeddings + 25GB documents) for the english edition of Wikipedia.
Embeddings is the engine that delivers semantic search. Data is transformed into embeddings vectors where similar concepts will produce similar vectors. An embeddings index generated by txtai is a fully encapsulated index format. It dosen't require a database server.
This index is built from the Wikipedia march 2024 dataset.
The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). It also uses Wikipedia Page Views data to add a percentile
field. The percentile
field can be used
to only match commonly visited pages.
txtai must be (pip) installed to use this.
Example code
from txtai.embeddings import Embeddings
import json
# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="burgerbee/txtai-en-wikipedia")
# Run a search
for x in embeddings.search("Bob Dylans second album", 1):
print(x["text"])
# Run a search and filter on popular results (page views).
for x in embeddings.search("SELECT id, text, score, percentile FROM txtai WHERE similar('Where in the World Is Carmen Sandiego?') AND percentile >= 0.99", 1):
print(json.dumps(x, indent=2))
Example output
The Freewheelin' Bob Dylan is the second studio album by American singer-songwriter Bob Dylan, released on May 27, 1963 by Columbia Records... (full article)
{
"id": "Where in the World Is Carmen Sandiego? (game show)",
"text": "Where in the World Is Carmen Sandiego? is an American half-hour children's television game show based on... (full article)
"score": 0.8537465929985046,
"percentile": 0.996002961084341
}
Data source
https://dumps.wikimedia.org/enwiki/
https://dumps.wikimedia.org/other/pageview_complete/
https://huggingface.co/datasets/burgerbee/wikipedia-en-20240320
- Downloads last month
- 7