nekoboost commited on
Commit
cb1dd4d
·
1 Parent(s): 83e473d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +39 -0
README.md ADDED
@@ -0,0 +1,39 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: cc-by-nc-sa-4.0
3
+ language:
4
+ - cs
5
+ - en
6
+ pipeline_tag: sentence-similarity
7
+ ---
8
+
9
+ ## SimCSE
10
+
11
+ SimCSE-RetroMAE-Small is the [Seznam/Seznam/dist-mpnet-czeng-cs-en](https://huggingface.co/Seznam/dist-mpnet-czeng-cs-en) model fine-tuned with the [SimCSE](https://arxiv.org/abs/2104.08821) objective.
12
+
13
+ This model was created at Seznam.cz as part of a project to create high-quality small Czech semantic embedding models. These models perform well across various natural language processing tasks, including similarity search, retrieval, clustering, and classification. For further details or evaluation results, please visit the associated [paper]() or [GitHub repository]((https://github.com/seznam/czech-semantic-embedding-models)).
14
+
15
+ ## How to Use
16
+
17
+ You can load and use the model like this:
18
+
19
+ ```python
20
+ import torch
21
+ from transformers import AutoModel, AutoTokenizer
22
+
23
+ model_name = "Seznam/retromae-small-cs" # Hugging Face link
24
+ tokenizer = AutoTokenizer.from_pretrained(model_name)
25
+ model = AutoModel.from_pretrained(model_name)
26
+
27
+ input_texts = [
28
+ "Dnes je výborné počasí na procházku po parku.",
29
+ "Večer si oblíbím dobrý film a uvařím si čaj."
30
+ ]
31
+
32
+ # Tokenize the input texts
33
+ batch_dict = tokenizer(input_texts, max_length=512, padding=True, truncation=True, return_tensors='pt')
34
+
35
+ outputs = model(**batch_dict)
36
+ embeddings = outputs.last_hidden_state[:, 0] # Extract CLS token embeddings
37
+
38
+ similarity = torch.nn.functional.cosine_similarity(embeddings[0], embeddings[1], dim=0)
39
+ ```