ct2fast-all-MiniLM-L12-v2 / README.md

nitsuai

Duplicate from michaelfeil/ct2fast-all-MiniLM-L12-v2

2094c87 verified 5 months ago

preview code

raw

history blame contribute delete

No virus

2.71 kB

	'---
	pipeline_tag: sentence-similarity
	tags:
	- ctranslate2
	- int8
	- float16
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	language: en
	license: apache-2.0
	datasets:
	- s2orc
	- flax-sentence-embeddings/stackexchange_xml
	- MS Marco
	- gooaq
	- yahoo_answers_topics
	- code_search_net
	- search_qa
	- eli5
	- snli
	- multi_nli
	- wikihow
	- natural_questions
	- trivia_qa
	- embedding-data/sentence-compression
	- embedding-data/flickr30k-captions
	- embedding-data/altlex
	- embedding-data/simple-wiki
	- embedding-data/QQP
	- embedding-data/SPECTER
	- embedding-data/PAQ_pairs
	- embedding-data/WikiAnswers

	---
	# # Fast-Inference with Ctranslate2
	Speedup inference while reducing memory by 2x-4x using int8 inference in C++ on CPU or GPU.

	quantized version of [sentence-transformers/all-MiniLM-L12-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L12-v2)
	```bash
	pip install hf-hub-ctranslate2>=2.12.0 ctranslate2>=3.17.1
	```

	```python
	# from transformers import AutoTokenizer
	model_name = "michaelfeil/ct2fast-all-MiniLM-L12-v2"
	model_name_orig="sentence-transformers/all-MiniLM-L12-v2"

	from hf_hub_ctranslate2 import EncoderCT2fromHfHub
	model = EncoderCT2fromHfHub(
	# load in int8 on CUDA
	model_name_or_path=model_name,
	device="cuda",
	compute_type="int8_float16"
	)
	outputs = model.generate(
	text=["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
	max_length=64,
	) # perform downstream tasks on outputs
	outputs["pooler_output"]
	outputs["last_hidden_state"]
	outputs["attention_mask"]

	# alternative, use SentenceTransformer Mix-In
	# for end-to-end Sentence embeddings generation
	# (not pulling from this CT2fast-HF repo)

	from hf_hub_ctranslate2 import CT2SentenceTransformer
	model = CT2SentenceTransformer(
	model_name_orig, compute_type="int8_float16", device="cuda"
	)
	embeddings = model.encode(
	["I like soccer", "I like tennis", "The eiffel tower is in Paris"],
	batch_size=32,
	convert_to_numpy=True,
	normalize_embeddings=True,
	)
	print(embeddings.shape, embeddings)
	scores = (embeddings @ embeddings.T) * 100

	# Hint: you can also host this code via REST API and
	# via github.com/michaelfeil/infinity


	```

	Checkpoint compatible to [ctranslate2>=3.17.1](https://github.com/OpenNMT/CTranslate2)
	and [hf-hub-ctranslate2>=2.12.0](https://github.com/michaelfeil/hf-hub-ctranslate2)
	- `compute_type=int8_float16` for `device="cuda"`
	- `compute_type=int8` for `device="cpu"`

	Converted on 2023-10-13 using
	```
	LLama-2 -> removed <pad> token.
	```

	# Licence and other remarks:
	This is just a quantized version. Licence conditions are intended to be idential to original huggingface repo.

	# Original description