jgrosjean-mathesis
/

sentence-swissbert

Sentence Similarity

Inference Endpoints

Model card Files Files and versions Community

jgrosjean commited on Dec 18, 2023

Commit

d8f35d4

·

1 Parent(s): ef4d93a

Update README.md

Files changed (1) hide show

README.md +33 -16

README.md CHANGED Viewed

@@ -40,10 +40,19 @@ from transformers import AutoModel, AutoTokenizer
 model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"
 model = AutoModel.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name)
-model.set_default_language("de_CH")
-def generate_sentence_embedding(sentence, ):
     # Tokenize input sentence
     inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)
@@ -56,7 +65,7 @@ def generate_sentence_embedding(sentence, ):
     return embedding
-sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.")
 print(sentence_embedding)
 ```
 Output:
@@ -67,6 +76,26 @@ tensor([[ 5.6306e-02, -2.8375e-01, -4.1495e-02,  7.4393e-02, -3.1552e-01,
         ...]])
 ```
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
@@ -162,18 +191,6 @@ Carbon emissions can be estimated using the [Machine Learning Impact calculator]
 [More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->

 model_name="jgrosjean-mathesis/swissbert-for-sentence-embeddings"
 model = AutoModel.from_pretrained(model_name)
 tokenizer = AutoTokenizer.from_pretrained(model_name)
+def generate_sentence_embedding(sentence, language):
+    # Set adapter to specified language
+    if "de" in language:
+      model.set_default_language("de_CH")
+    if "fr" in language:
+      model.set_default_language("fr_CH")
+    if "it" in language:
+      model.set_default_language("it_CH")
+    if "rm" in language:
+      model.set_default_language("rm_CH")
     # Tokenize input sentence
     inputs = tokenizer(sentence, padding=True, truncation=True, return_tensors="pt", max_length=512)
     return embedding
+sentence_embedding = generate_sentence_embedding("Wir feiern am 1. August den Schweizer Nationalfeiertag.", language="de")
 print(sentence_embedding)
 ```
 Output:
         ...]])
 ```
+### Semantic Textual Similarity
+```python
+from sklearn.metrics.pairwise import cosine_similarity
+# Define two sentences
+sentence_1 = ["Der Zug kommt um 9 Uhr in Zürich an."]
+sentence_2 = ["Le train arrive à Lausanne à 9h."]
+#Compute embedding for both
+embedding_1 = generate_sentence_embedding(sentence_1, language="de")
+embedding_2 = generate_sentence_embedding(sentence_2, language="fr")
+#Compute cosine-similarity
+cosine_score = cosine_similarity((embedding_1, embedding_2)
+#Output the score
+print("The cosine score for", sentence_1, "and", sentence_2, "is", cosine_score)
+```
 ## Bias, Risks, and Limitations
 <!-- This section is meant to convey both technical and sociotechnical limitations. -->
 [More Information Needed]
 ## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->