prudant commited on
Commit
cafbc1a
1 Parent(s): 411fed8

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +54 -0
README.md CHANGED
@@ -31,6 +31,60 @@ This LSG variant has been adapted from our original model, ["hiiamsid/sentence_s
31
 
32
  The LSG-enhanced model is particularly adept at tasks involving longer documents, where capturing the essence of extended context is crucial.
33
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
34
  #### Acknowledgments
35
 
36
  This model has been adapted by Darío Muñoz Prudant, thanks to the Hugging Face community and contributors to the LSG attention mechanism for their resources and support.
 
31
 
32
  The LSG-enhanced model is particularly adept at tasks involving longer documents, where capturing the essence of extended context is crucial.
33
 
34
+
35
+ ```python
36
+ import torch.nn.functional as F
37
+ from transformers import AutoTokenizer, AutoModel
38
+ import torch
39
+
40
+ tokenizer = AutoTokenizer.from_pretrained('prudant/lsg_4096_sentence_similarity_spanish')
41
+ model = AutoModel.from_pretrained('prudant/lsg_4096_sentence_similarity_spanish', trust_remote_code=True)
42
+
43
+ def mean_pooling(model_output, attention_mask):
44
+ token_embeddings = model_output[0] #First element of model_output contains all token embeddings
45
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
46
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
47
+
48
+ # Sentences
49
+ sentences = [
50
+ 'Esa es una persona feliz',
51
+ "Ese es un perro feliz",
52
+ "Esa es una persona muy feliz",
53
+ "Hoy es un día soleado",
54
+ "Esa es una persona alegre",
55
+ ]
56
+
57
+ # Tokenize sentences
58
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
59
+
60
+ # Compute token embeddings
61
+ with torch.no_grad():
62
+ model_output = model(**encoded_input)
63
+
64
+ # Perform pooling. In this case, max pooling.
65
+ sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
66
+
67
+ print("Sentence embeddings:")
68
+ print(sentence_embeddings)
69
+
70
+ # Norm embeddings
71
+ normalized_embeddings = F.normalize(sentence_embeddings, p=2, dim=1)
72
+
73
+ # Sentence similarity
74
+ cosine_similarities = F.cosine_similarity(normalized_embeddings[0].unsqueeze(0), normalized_embeddings[1:], dim=1)
75
+
76
+ print(cosine_similarities)
77
+ ```
78
+
79
+ Sentence embeddings:
80
+ tensor([[-0.1691, -0.2517, -1.3000, ..., 0.1557, 0.3824, 0.2048],
81
+ [ 0.1872, -0.7604, -0.4863, ..., -0.4922, -0.1511, -0.8539],
82
+ [-0.2467, -0.2373, -1.1708, ..., 0.4637, 0.0616, 0.2841],
83
+ [-0.2384, 0.1681, -0.3498, ..., -0.2744, -0.1722, -1.2513],
84
+ [ 0.2273, -0.2393, -1.6124, ..., 0.6065, 0.2784, -0.3354]])
85
+
86
+ tensor([0.5132, 0.9346, 0.3471, 0.8543])
87
+
88
  #### Acknowledgments
89
 
90
  This model has been adapted by Darío Muñoz Prudant, thanks to the Hugging Face community and contributors to the LSG attention mechanism for their resources and support.