Description

We use MS Marco Encoder msmarco-MiniLM-L-6-v3 from the sentence-transformers library to encode the text from dataset abokbot/wikipedia-first-paragraph.

The dataset contains the first paragraphs of the English "20220301.en" version of the Wikipedia dataset.

The output is an embedding tensor of size [6458670, 384].

Code

It was obtained by running the following code.

from datasets import load_dataset
from sentence_transformers import SentenceTransformer

dataset = load_dataset("abokbot/wikipedia-first-paragraph")
bi_encoder = SentenceTransformer('msmarco-MiniLM-L-6-v3')
bi_encoder.max_seq_length = 256
wikipedia_embedding = bi_encoder.encode(dataset["text"], convert_to_tensor=True, show_progress_bar=True)

This operation took 35min on a Google Colab notebook with GPU.

Reference

More information of MS Marco encoders here https://www.sbert.net/docs/pretrained-models/ce-msmarco.html

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train abokbot/wikipedia-embedding

Spaces using abokbot/wikipedia-embedding 2