yuji96 commited on
Commit
6bad1b4
1 Parent(s): f1941ef

update readme

Browse files
Files changed (1) hide show
  1. README.md +11 -99
README.md CHANGED
@@ -1,111 +1,23 @@
1
  ---
2
  tags:
3
- - sentence-transformers
4
- - feature-extraction
5
- - sentence-similarity
6
- - transformers
7
  datasets:
8
- - shunk031/jsnli
9
  license: cc-by-sa-4.0
10
  language:
11
- - ja
12
  metrics:
13
- - spearmanr
14
  pipeline_tag: sentence-similarity
15
  library_name: generic
16
  ---
17
 
18
- # sup-simcse-ja-base
 
19
 
 
20
 
21
- ## Usage (Sentence-Transformers)
22
-
23
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
24
-
25
- ```
26
- pip install -U fugashi[unidic-lite] sentence-transformers
27
- ```
28
-
29
- Then you can use the model like this:
30
-
31
- ```python
32
- from sentence_transformers import SentenceTransformer
33
- sentences = ["こんにちは、世界!", "文埋め込み最高!文埋め込み最高と叫びなさい", "極度乾燥しなさい"]
34
-
35
- model = SentenceTransformer("sup-simcse-ja-base")
36
- embeddings = model.encode(sentences)
37
- print(embeddings)
38
- ```
39
-
40
- ## Usage (HuggingFace Transformers)
41
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.
42
-
43
- ```python
44
- from transformers import AutoTokenizer, AutoModel
45
- import torch
46
-
47
-
48
- def cls_pooling(model_output, attention_mask):
49
- return model_output[0][:,0]
50
-
51
-
52
- # Sentences we want sentence embeddings for
53
- sentences = ['This is an example sentence', 'Each sentence is converted']
54
-
55
- # Load model from HuggingFace Hub
56
- tokenizer = AutoTokenizer.from_pretrained("sup-simcse-ja-base")
57
- model = AutoModel.from_pretrained("sup-simcse-ja-base")
58
-
59
- # Tokenize sentences
60
- encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
61
-
62
- # Compute token embeddings
63
- with torch.no_grad():
64
- model_output = model(**encoded_input)
65
-
66
- # Perform pooling. In this case, cls pooling.
67
- sentence_embeddings = cls_pooling(model_output, encoded_input['attention_mask'])
68
-
69
- print("Sentence embeddings:")
70
- print(sentence_embeddings)
71
- ```
72
-
73
- ## Full Model Architecture
74
- ```
75
- SentenceTransformer(
76
- (0): Transformer({'max_seq_length': 512, 'do_lower_case': False}) with Transformer model: BertModel
77
- (1): Pooling({'word_embedding_dimension': 768, 'pooling_mode_cls_token': True, 'pooling_mode_mean_tokens': False, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
78
- )
79
- ```
80
-
81
- ## Model Summary
82
-
83
- - Fine-tuning method: Supervised SimCSE
84
- - Base model: [cl-tohoku/bert-base-japanese-v3](https://huggingface.co/cl-tohoku/bert-base-japanese-v3)
85
- - Training dataset: [JSNLI](https://nlp.ist.i.kyoto-u.ac.jp/?%E6%97%A5%E6%9C%AC%E8%AA%9ESNLI%28JSNLI%29%E3%83%87%E3%83%BC%E3%82%BF%E3%82%BB%E3%83%83%E3%83%88)
86
- - Pooling strategy: cls (with an extra MLP layer only during training)
87
- - Hidden size: 768
88
- - Learning rate: 5e-5
89
- - Batch size: 512
90
- - Temperature: 0.05
91
- - Max sequence length: 64
92
- - Number of training examples: 2^20
93
- - Validation interval (steps): 2^6
94
- - Warmup ratio: 0.1
95
- - Dtype: BFloat16
96
-
97
- See the [GitHub repository](https://github.com/hppRC/simple-simcse-ja) for a detailed experimental setup.
98
-
99
- ## Citing & Authors
100
-
101
- ```
102
- @misc{
103
- hayato-tsukagoshi-2023-simple-simcse-ja,
104
- author = {Hayato Tsukagoshi},
105
- title = {Japanese Simple-SimCSE},
106
- year = {2023},
107
- publisher = {GitHub},
108
- journal = {GitHub repository},
109
- howpublished = {\url{https://github.com/hppRC/simple-simcse-ja}}
110
- }
111
- ```
 
1
  ---
2
  tags:
3
+ - sentence-transformers
4
+ - feature-extraction
5
+ - sentence-similarity
6
+ - transformers
7
  datasets:
8
+ - shunk031/jsnli
9
  license: cc-by-sa-4.0
10
  language:
11
+ - ja
12
  metrics:
13
+ - spearmanr
14
  pipeline_tag: sentence-similarity
15
  library_name: generic
16
  ---
17
 
18
+ sentence-transformers の widget を日本語対応できないか実験しています。
19
+ generic library を実行するために public repo にしています。
20
 
21
+ pipeline.py, README.md, requirements.txt 以外のファイルは [cl-nagoya/sup-simcse-ja-base](https://huggingface.co/cl-nagoya/sup-simcse-ja-base) (CC BY-SA 4.0) のコピーです。
22
 
23
+ (language tag が Japanese なら裏側で `pip install transformer[ja]` をするのが最善に感じますが、contribute できそうな repository が見当たりませんでした。)