yuji96 commited on
Commit
6e4beb4
1 Parent(s): ae83505

Upload README.md

Browse files
Files changed (1) hide show
  1. README.md +110 -0
README.md ADDED
@@ -0,0 +1,110 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ - transformers
7
+ datasets:
8
+ - shunk031/jsnli
9
+ license: cc-by-sa-4.0
10
+ language:
11
+ - ja
12
+ metrics:
13
+ - spearmanr
14
+ library_name: sentence-transformers
15
+ ---
16
+
17
+ # sup-simcse-ja-base
18
+
19
+
20
+ ## Usage (Sentence-Transformers)
21
+
22
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
23
+
24
+ ```
25
+ pip install -U fugashi[unidic-lite] sentence-transformers
26
+ ```
27
+
28
+ Then you can use the model like this:
29
+
30
+ ```python
31
+ from sentence_transformers import SentenceTransformer
32
+ sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"]
33
+
34
+ model = SentenceTransformer("sup-simcse-ja-base")
35
+ embeddings = model.encode(sentences)
36
+ print(embeddings)
37
+ ```
38
+
39
+ ## Usage (HuggingFace Transformers)
40
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
41
+
42
+ ```python
43
+ from transformers import AutoTokenizer, AutoModel
44
+ import torch
45
+
46
+
47
+ def cls_pooling(model_output, attention_mask):
48
+ return model_output[0][:,0]
49
+
50
+
51
+ # Sentences we want sentence embeddings for
52
+ sentences = ['This is an example sentence', 'Each sentence is converted']
53
+
54
+ # Load model from HuggingFace Hub
55
+ tokenizer = AutoTokenizer.from_pretrained("sup-simcse-ja-base")
56
+ model = AutoModel.from_pretrained("sup-simcse-ja-base")
57
+
58
+ # Tokenize sentences
59
+ encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
60
+
61
+ # Compute token embeddings
62
+ with torch.no_grad():
63
+ model_output = model(**encoded_input)
64
+
65
+ # Perform pooling. In this case, cls pooling.
66
+ sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
67
+
68
+ print("Sentence embeddings:")
69
+ print(sentence_embeddings)
70
+ ```
71
+
72
+ ## Full Model Architecture
73
+ ```
74
+ SentenceTransformer(
75
+ (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
76
+ (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
77
+ )
78
+ ```
79
+
80
+ ## Model Summary
81
+
82
+ - Fine-tuning method: Supervised SimCSE
83
+ - Base model: [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3)
84
+ - Training dataset: [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
85
+ - Pooling strategy: cls (with an extra MLP layer only during training)
86
+ - Hidden size: 768
87
+ - Learning rate: 5e-5
88
+ - Batch size: 512
89
+ - Temperature: 0.05
90
+ - Max sequence length: 64
91
+ - Number of training examples: 2^20
92
+ - Validation interval (steps): 2^6
93
+ - Warmup ratio: 0.1
94
+ - Dtype: BFloat16
95
+
96
+ See the [GitHub repository](https://github.com/hppRC/simple-simcse-ja) for a detailed experimental setup.
97
+
98
+ ## Citing & Authors
99
+
100
+ ```
101
+ @misc{
102
+ hayato-tsukagoshi-2023-simple-simcse-ja,
103
+ author = {Hayato Tsukagoshi},
104
+ title = {Japanese Simple-SimCSE},
105
+ year = {2023},
106
+ publisher = {GitHub},
107
+ journal = {GitHub repository},
108
+ howpublished = {\url{https://github.com/hppRC/simple-simcse-ja}}
109
+ }
110
+ ```