Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

'--- pipeline_tag: sentence-similarity tags:

  • ctranslate2
  • int8
  • float16
  • sentence-transformers
  • feature-extraction
  • sentence-similarity language: en license: apache-2.0 datasets:
  • s2orc
  • flax-sentence-embeddings/stackexchange_xml
  • MS Marco
  • gooaq
  • yahoo_answers_topics
  • code_search_net
  • search_qa
  • eli5
  • snli
  • multi_nli
  • wikihow
  • natural_questions
  • trivia_qa
  • embedding-data/sentence-compression
  • embedding-data/flickr30k-captions
  • embedding-data/altlex
  • embedding-data/simple-wiki
  • embedding-data/QQP
  • embedding-data/SPECTER
  • embedding-data/PAQ_pairs
  • embedding-data/WikiAnswers

# Fast-Inference with Ctranslate2

Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.

quantized version of sentence-transformers/all-MiniLM-L12-v2

pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
# from transformers import AutoTokenizer
model_name = "michaelfeil/ct2fast-all-MiniLM-L12-v2"
model_name_orig="sentence-transformers/all-MiniLM-L12-v2"

from hf_hub_ctranslate2 import EncoderCT2fromHfHub
model = EncoderCT2fromHfHub(
        # load in int8 on CUDA
        model_name_or_path=model_name,
        device="cuda",
        compute_type="int8_float16"
)
outputs = model.generate(
    text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
    max_length=64,
) # perform downstream tasks on outputs
outputs["pooler_output"]
outputs["last_hidden_state"]
outputs["attention_mask"]

# alternative, use SentenceTransformer Mix-In
# for end-to-end Sentence embeddings generation
# (not pulling from this CT2fast-HF repo)

from hf_hub_ctranslate2 import CT2SentenceTransformer
model = CT2SentenceTransformer(
    model_name_orig, compute_type="int8_float16", device="cuda"
)
embeddings = model.encode(
    ["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
    batch_size=32,
    convert_to_numpy=True,
    normalize_embeddings=True,
)
print(embeddings.shape, embeddings)
scores = (embeddings @ embeddings.T) * 100

# Hint: you can also host this code via REST API and
# via github.com/michaelfeil/infinity  

Checkpoint compatible to ctranslate2>=3.17.1 and hf-hub-ctranslate2>=2.12.0

  • compute_type=int8_float16 for device="cuda"
  • compute_type=int8 for device="cpu"

Converted on 2023-10-13 using

LLama-2 -> removed <pad> token.

Licence and other remarks:

This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

Original description

Downloads last month
2
Unable to determine this model’s pipeline type. Check the docs .