NeuML
/

txtai-arxiv

Sentence Similarity

Model card Files Files and versions Community

txtai-arxiv / README.md

davidmezzetti's picture

Update README

e4c367b 5 months ago

|

history blame contribute delete

3.04 kB

	---
	inference: false
	language: en
	license:
	- cc0-1.0
	library_name: txtai
	tags:
	- sentence-similarity
	datasets:
	- arxiv_dataset
	---

	# arXiv txtai embeddings index

	This is a [txtai](https://github.com/neuml/txtai) embeddings index for the [arXiv dataset](https://hf.co/datasets/arxiv_dataset) [metadata](https://info.arxiv.org/help/prep.html).

	txtai must be [installed](https://neuml.github.io/txtai/install/) to use this model.

	## Example

	This index can be loaded from the Hugging Face Hub with txtai as shown below.

	```python
	from txtai.embeddings import Embeddings

	# Load the index from the HF Hub
	embeddings = Embeddings()
	embeddings.load(provider="huggingface-hub", container="neuml/txtai-arxiv")

	# Search for papers matching a query
	embeddings.search("Survey of vector databases")

	# Search for papers matching an abstract
	embeddings.search("""
	Mixtral 8x7B, a Sparse Mixture of Experts (SMoE) language model. Mixtral
	has the same architecture as Mistral 7B, with the difference that each
	layer is composed of 8 feedforward blocks (i.e. experts). For every
	token, at each layer, a router network selects two experts to process
	the current state and combine their outputs.
	""")

	embeddings.search("""
	Humanity has wondered whether we are alone for millennia. The discovery
	of life elsewhere in the Universe, particularly intelligent life, would
	have profound effects, comparable to those of recognizing that the Earth
	is not the center of the Universe and that humans evolved from previous
	species.
	""")

	embeddings.search("""
	The main objective of this paper is to investigate the extent to which
	the margin of victory can be predicted solely by the rankings of the
	opposing teams in NCAA Division I men's basketball games.
	""")
	```

	## Use Cases

	An embeddings index generated by txtai is a fully encapsulated index format. It doesn't require a database server or dependencies outside of the Python install.

	The arXiv index works well as a fact-based context source for retrieval augmented generation (RAG). In other words, search results from this model can be passed to LLM prompts as the context in which to answer questions.

	Additionally, this model can identify articles to cite in research. Passing a title + abstract pair will find similar existing articles.

	## Build the index

	The following steps show how to build this index.

	- Install required build dependencies
	```bash
	pip install ragdata
	```

	- Follow these [instructions](https://huggingface.co/datasets/arxiv_dataset/blob/main/arxiv_dataset.py#L67) to download the dataset

	- Build txtai-arxiv index
	```bash
	python -m ragdata.arxiv.index \
	-d <path to directory with file downloaded in previous step> \
	-o txtai-arxiv
	```

	## More information

	See the following links for more information on the arXiv metadata dataset.

	- [Dataset on Hugging Face](https://huggingface.co/datasets/arxiv_dataset)
	- [Dataset on Kaggle](https://www.kaggle.com/datasets/Cornell-University/arxiv)
	- [Metadata description](https://info.arxiv.org/help/prep.html)