'--- pipeline_tag: sentence-similarity tags:

ctranslate2
int8
float16
sentence-transformers
feature-extraction
sentence-similarity language: en license: apache-2.0 datasets:
s2orc
flax-sentence-embeddings/stackexchange_xml
MS Marco
gooaq
yahoo_answers_topics
code_search_net
search_qa
eli5
snli
multi_nli
wikihow
natural_questions
trivia_qa
embedding-data/sentence-compression
embedding-data/flickr30k-captions
embedding-data/altlex
embedding-data/simple-wiki
embedding-data/QQP
embedding-data/SPECTER
embedding-data/PAQ_pairs
embedding-data/WikiAnswers

# Fast-Inference with Ctranslate2

Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.

quantized version of sentence-transformers/all-MiniLM-L12-v2

pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1

# from transformers import AutoTokenizer
model_name = "michaelfeil/ct2fast-all-MiniLM-L12-v2"
model_name_orig="sentence-transformers/all-MiniLM-L12-v2"

from hf_hub_ctranslate2 import EncoderCT2fromHfHub
model = EncoderCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name,
        device="cuda",
        compute_type="int8_float16"
)
outputs = model.generate(
    text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
    max_length=64,
) # perform downstream tasks on outputs
outputs["pooler_output"]
outputs["last_hidden_state"]
outputs["attention_mask"]

# alternative, use SentenceTransformer Mix-In
# for end-to-end Sentence embeddings generation
# (not pulling from this CT2fast-HF repo)

from hf_hub_ctranslate2 import CT2SentenceTransformer
model = CT2SentenceTransformer(
    model_name_orig, compute_type="int8_float16", device="cuda"
)
embeddings = model.encode(
    ["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
    batch_size=32,
    convert_to_numpy=True,
    normalize_embeddings=True,
)
print(embeddings.shape, embeddings)
scores = (embeddings @ embeddings.T) * 100

# Hint: you can also host this code via REST API and
# via github.com/michaelfeil/infinity

Checkpoint compatible to ctranslate2>=3.17.1 and hf-hub-ctranslate2>=2.12.0

compute_type=int8_float16 for device="cuda"
compute_type=int8 for device="cpu"

Converted on 2023-10-13 using

LLama-2 -> removed <pad> token.

Licence and other remarks:

This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

nitsuai
/

ct2fast-all-MiniLM-L12-v2

# Fast-Inference with Ctranslate2

Licence and other remarks:

Original description