File size: 3,932 Bytes

---
inference: false
language: en
license:
- cc-by-sa-3.0
- gfdl
library_name: txtai
tags:
- sentence-similarity
datasets:
- NeuML/wikipedia-20240901
---

# Wikipedia txtai embeddings index

This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [English edition of Wikipedia](https://en.wikipedia.org/).

This index is built from the [Wikipedia September 2024 dataset](https://huggingface.co/datasets/neuml/wikipedia-20240901). Only the first paragraph of the [lead section](https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section) from each article is included in the index. This is similar to an abstract of the article.

It also uses [Wikipedia Page Views](https://dumps.wikimedia.org/other/pageviews/readme.html) data to add a `percentile` field. The `percentile` field can be used
to only match commonly visited pages.

txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.

## Example

See the example below. This index requires txtai >= 7.4.

```python
from txtai.embeddings import Embeddings

# Load the index from the HF Hub
embeddings = Embeddings()
embeddings.load(provider="huggingface-hub", container="neuml/txtai-wikipedia")

# Run a search
embeddings.search("Roman Empire")

# Run a search matching only the Top 1% of articles
embeddings.search("""
   SELECT id, text, score, percentile FROM txtai WHERE similar('Boston') AND
   percentile >= 0.99
""")
```

## Use Cases

An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

The Wikipedia index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

See this [article](https://neuml.hashnode.dev/embeddings-in-the-cloud) for additional examples on how to use this model.

## Evaluation Results

Performance was evaluated using the [NDCG@10](https://en.wikipedia.org/wiki/Discounted_cumulative_gain) score with a [custom question-answer evaluation set](https://github.com/neuml/txtchat/tree/master/datasets/wikipedia). Results are shown below.

| Model                                                      | NDCG@10    | MAP@10     |
| ---------------------------------------------------------- | ---------- | ---------  |
| [bge-base-en-v1.5](https://hf.co/BAAI/bge-large-en-v1.5)   | 0.6320     | 0.5485     |
| [**e5-base**](https://hf.co/intfloat/e5-base)              | **0.7021** | **0.6517** |
| [gte-base](https://hf.co/thenlper/gte-base)                | 0.6775     | 0.6350     |

`e5-base` is the best performing model for the evaluation set. This highlights the importance of testing models as `e5-base` is far from the leading model on the [MTEB leaderboard](https://huggingface.co/spaces/mteb/leaderboard). Benchmark datasets are only a guide.

## Build the index

The following steps show how to build this index. These scripts are using the latest data available as of 2024-09-01, update as appropriate.

- Install required build dependencies
```bash
pip install txtchat mwparserfromhell datasets
```

- Download and build pageviews database
```bash
mkdir -p pageviews/data
wget -P pageviews/data https://dumps.wikimedia.org/other/pageview_complete/monthly/2024/2024-08/pageviews-202408-user.bz2
python -m txtchat.data.wikipedia.views -p en.wikipedia -v pageviews
```

- Build Wikipedia dataset

```python
from datasets import load_dataset

# Data dump date from https://dumps.wikimedia.org/enwiki/
date = "20240901"

# Build and save dataset
ds = load_dataset("neuml/wikipedia", language="en", date=date)
ds.save_to_disk(f"wikipedia-{date}")
```

- Build txtai-wikipedia index
```bash
python -m txtchat.data.wikipedia.index \
       -d wikipedia-20240901 \
       -o txtai-wikipedia \
       -v pageviews/pageviews.sqlite
```