msmarco-distilbert-base-tas-b / README.md

Update Readme to include language and dataset trained on

b11c8b5 almost 2 years ago

No virus

3.99 kB

	---
	pipeline_tag: sentence-similarity
	license: apache-2.0
	language: "en"
	tags:
	- sentence-transformers
	- feature-extraction
	- sentence-similarity
	- transformers
	datasets:
	- ms_marco
	---

	# sentence-transformers/msmarco-distilbert-base-tas-b

	This is a port of the [DistilBert TAS-B Model](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco) to [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 768 dimensional dense vector space and is optimized for the task of semantic search.



	## Usage (Sentence-Transformers)

	Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:

	```
	pip install -U sentence-transformers
	```

	Then you can use the model like this:

	```python
	from sentence_transformers import SentenceTransformer, util

	query = "How many people live in London?"
	docs = ["Around 9 Million people live in London", "London is known for its financial district"]

	#Load the model
	model = SentenceTransformer('sentence-transformers/msmarco-distilbert-base-tas-b')

	#Encode query and documents
	query_emb = model.encode(query)
	doc_emb = model.encode(docs)

	#Compute dot score between query and all document embeddings
	scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()

	#Combine docs & scores
	doc_score_pairs = list(zip(docs, scores))

	#Sort by decreasing score
	doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

	#Output passages & scores
	for doc, score in doc_score_pairs:
	print(score, doc)
	```



	## Usage (HuggingFace Transformers)
	Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	#CLS Pooling - Take output from first token
	def cls_pooling(model_output):
	return model_output.last_hidden_state[:,0]

	#Encode text
	def encode(texts):
	# Tokenize sentences
	encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')

	# Compute token embeddings
	with torch.no_grad():
	model_output = model(**encoded_input, return_dict=True)

	# Perform pooling
	embeddings = cls_pooling(model_output)

	return embeddings


	# Sentences we want sentence embeddings for
	query = "How many people live in London?"
	docs = ["Around 9 Million people live in London", "London is known for its financial district"]

	# Load model from HuggingFace Hub
	tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")
	model = AutoModel.from_pretrained("sentence-transformers/msmarco-distilbert-base-tas-b")

	#Encode query and docs
	query_emb = encode(query)
	doc_emb = encode(docs)

	#Compute dot score between query and all document embeddings
	scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()

	#Combine docs & scores
	doc_score_pairs = list(zip(docs, scores))

	#Sort by decreasing score
	doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)

	#Output passages & scores
	for doc, score in doc_score_pairs:
	print(score, doc)
	```



	## Evaluation Results



	For an automated evaluation of this model, see the Sentence Embeddings Benchmark: [https://seb.sbert.net](https://seb.sbert.net?model_name=sentence-transformers/msmarco-distilbert-base-tas-b)



	## Full Model Architecture
	```
	SentenceTransformer(
	(0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: DistilBertModel
	(1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
	)
	```

	## Citing & Authors

	Have a look at: [DistilBert TAS-B Model](https://huggingface.co/sebastian-hofstaetter/distilbert-dot-tas_b-b256-msmarco)