VirtualInsight commited on
Commit
9196686
verified
1 Parent(s): 6fc4506

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +86 -3
README.md CHANGED
@@ -1,3 +1,86 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+ # Cosine-Embed
5
+
6
+ Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.
7
+
8
+ ## What it produces
9
+ - Input: tokenized text (`input_ids`, `attention_mask`)
10
+ - Output: an embedding vector of size `hidden_dim` with L2 normalization
11
+ - Cosine similarity: `cos(a, b) = embedding(a) 路 embedding(b)`
12
+
13
+ ## Model details
14
+ - Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
15
+ - Masked mean pooling over token embeddings
16
+ - Final L2 normalization
17
+
18
+ ## Default configuration
19
+ These parameters are used in `Notebooks/Training.ipynb`:
20
+ - `vocab_size`: 30522
21
+ - `seq_len`: 128
22
+ - `hidden_dim`: 512
23
+ - `n_heads`: 8
24
+ - `n_layer`: 3
25
+ - `ff_dim`: 2048
26
+ - `eps`: 1e-5
27
+ - `dropout`: 0.1
28
+
29
+ ## Training objective
30
+ The model is trained with triplet loss on cosine similarity:
31
+
32
+ `loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)`
33
+
34
+ ## Checkpoints
35
+ - `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs)
36
+ - `checkpoints/model.safetensors`: weights-only export for inference
37
+
38
+ ## Minimal inference
39
+ ```python
40
+ import torch
41
+ from transformers import AutoTokenizer
42
+ from safetensors.torch import load_file
43
+
44
+ from Architecture import EmbeddingModel, ModelConfig
45
+
46
+ device = "cuda" if torch.cuda.is_available() else "cpu"
47
+
48
+ state_dict = load_file("checkpoints/model.safetensors")
49
+
50
+ cfg = ModelConfig(
51
+ vocab_size=30522,
52
+ seq_len=128,
53
+ hidden_dim=512,
54
+ n_heads=8,
55
+ n_layer=3,
56
+ eps=1e-5,
57
+ ff_dim=2048,
58
+ dropout=0.1,
59
+ )
60
+
61
+ model = EmbeddingModel(cfg).to(device)
62
+ model.load_state_dict(state_dict)
63
+ model.eval()
64
+
65
+ tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
66
+
67
+ def embed(texts):
68
+ enc = tokenizer(
69
+ texts,
70
+ padding=True,
71
+ truncation=True,
72
+ max_length=128,
73
+ return_tensors="pt",
74
+ )
75
+ enc = {k: v.to(device) for k, v in enc.items()}
76
+ with torch.no_grad():
77
+ return model(enc["input_ids"], enc["attention_mask"]) # normalized
78
+
79
+ def cosine_similarity(a, b):
80
+ ea = embed([a])[0]
81
+ eb = embed([b])[0]
82
+ return float((ea * eb).sum().item())
83
+ ```
84
+
85
+ ## Notes
86
+ - Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).