---
inference: false
language: en
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
datasets:
- olm/olm-wikipedia-20221220
---

# Wikipedia txtai embeddings index

This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).

This index is built from the [OLM Wikipedia December 2022 dataset](https://huggingface.co/datasets/olm/olm-wikipedia-20221220).
Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index.
This is similar to an abstract of the article.

It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
to only match commonly visited pages.

txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.

## Example

Version 5.4 added support for loading embeddings indexes from the Hugging Face Hub. See the example below.

```python
from txtai.embeddings import Embeddings

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Run a search
embeddings.search("Roman Empire")

# Run a search matching only the Top 1% of articles
embeddings.search("""
   SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
   percentile >= 0.99
""")
```

## Use Cases

An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

The Wikipedia index works well as a fact-based context source for conversational search. In other words, search results from this model can be passed to LLM prompts as the
context in which to answer questions.