--- inference: false language: en license: - cc-by-sa-3.0 - gfdl library_name: txtai tags: - sentence-similarity datasets: - olm/olm-wikipedia-20221220 --- # Wikipedia txtai embeddings index This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/). This index is built from the [OLM Wikipedia December 2022 dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article. It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used to only match commonly visited pages. txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. ## Example Version 5.4 added support for loading embeddings indexes from the Hugging Face Hub. See the example below. ```python from txtai.embeddings import Embeddings # Load the index from the HF Hub embeddings = Embeddings() embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia") # Run a search embeddings.search("Roman Empire") # Run a search matching only the Top 1% of articles embeddings.search(""" SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND percentile >= 0.99 """) ``` ## Use Cases An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. The Wikipedia index works well as a fact-based context source for conversational search. In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.