|
--- |
|
inference: false |
|
language: en |
|
license: |
|
- cc0-1.0 |
|
library_name: txtai |
|
tags: |
|
- sentence-similarity |
|
datasets: |
|
- arxiv_dataset |
|
--- |
|
|
|
# arXiv txtai embeddings index |
|
|
|
This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [arXiv dataset](https://hf.co/datasets/arxiv_dataset) [metadata](https://info.arxiv.org/help/prep.html). |
|
|
|
txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model. |
|
|
|
## Example |
|
|
|
This index can be loaded from the Hugging Face Hub with txtai as shown below. |
|
|
|
```python |
|
from txtai.embeddings import Embeddings |
|
|
|
# Load the index from the HF Hub |
|
embeddings = Embeddings() |
|
embeddings.load(provider="huggingface-hub", container="neuml/txtai-arxiv") |
|
|
|
# Search for papers matching a query |
|
embeddings.search("Survey of vector databases") |
|
|
|
# Search for papers matching an abstract |
|
embeddings.search(""" |
|
Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral |
|
has the same architecture as Mistral 7B, with the difference that each |
|
layer is composed of 8 feedforward blocks (i.e. experts). For every |
|
token, at each layer, a router network selects two experts to process |
|
the current state and combine their outputs. |
|
""") |
|
|
|
embeddings.search(""" |
|
Humanity has wondered whether we are alone for millennia. The discovery |
|
of life elsewhere in the Universe, particularly intelligent life, would |
|
have profound effects, comparable to those of recognizing that the Earth |
|
is not the center of the Universe and that humans evolved from previous |
|
species. |
|
""") |
|
|
|
embeddings.search(""" |
|
The main objective of this paper is to investigate the extent to which |
|
the margin of victory can be predicted solely by the rankings of the |
|
opposing teams in NCAA Division I men's basketball games. |
|
""") |
|
``` |
|
|
|
## Use Cases |
|
|
|
An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install. |
|
|
|
The arXiv index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions. |
|
|
|
Additionally, this model can identify articles to cite in research. Passing a title + abstract pair will find similar existing articles. |
|
|
|
## Build the index |
|
|
|
The following steps show how to build this index. |
|
|
|
- Install required build dependencies |
|
```bash |
|
pip install txtchat datasets |
|
``` |
|
|
|
- Follow these [instructions](https://huggingface.co/datasets/arxiv_dataset/blob/main/arxiv_dataset.py#L67) to download the dataset |
|
|
|
- Build txtai-arxiv index |
|
```bash |
|
python -m txtchat.data.arxiv.index \ |
|
-d <path to directory with file downloaded in previous step> \ |
|
-o txtai-arxiv |
|
``` |
|
|
|
## More information |
|
|
|
See the following links for more information on the arXiv metadata dataset. |
|
|
|
- [Dataset on Hugging Face](https://huggingface.co/datasets/arxiv_dataset) |
|
- [Dataset on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv) |
|
- [Metadata description](https://info.arxiv.org/help/prep.html) |
|
|