VirtualInsight
/

Cosine-Embed

Model card Files Files and versions

VirtualInsight commited on 9 days ago

Commit

9196686

·

verified ·

1 Parent(s): 6fc4506

Update README.md

Files changed (1) hide show

README.md +86 -3

README.md CHANGED Viewed

@@ -1,3 +1,86 @@
----
-license: mit
----

+---
+license: mit
+---
+# Cosine-Embed
+Cosine-Embed is a PyTorch sentence embedding model trained to place similar texts close together in an embedding space. The model outputs L2-normalized vectors so cosine similarity is computed as a dot product.
+## What it produces
+- Input: tokenized text (`input_ids`, `attention_mask`)
+- Output: an embedding vector of size `hidden_dim` with L2 normalization
+- Cosine similarity: `cos(a, b) = embedding(a) · embedding(b)`
+## Model details
+- Transformer blocks (custom implementation using RMSNorm, RoPE positional encoding, and SwiGLU feed-forward)
+- Masked mean pooling over token embeddings
+- Final L2 normalization
+## Default configuration
+These parameters are used in `Notebooks/Training.ipynb`:
+- `vocab_size`: 30522
+- `seq_len`: 128
+- `hidden_dim`: 512
+- `n_heads`: 8
+- `n_layer`: 3
+- `ff_dim`: 2048
+- `eps`: 1e-5
+- `dropout`: 0.1
+## Training objective
+The model is trained with triplet loss on cosine similarity:
+`loss = max(0, sim(anchor, negative) - sim(anchor, positive) + margin)`
+## Checkpoints
+- `checkpoints/checkpoint.pt`: training checkpoint (model, optimizer, losses, and configs)
+- `checkpoints/model.safetensors`: weights-only export for inference
+## Minimal inference
+```python
+import torch
+from transformers import AutoTokenizer
+from safetensors.torch import load_file
+from Architecture import EmbeddingModel, ModelConfig
+device = "cuda" if torch.cuda.is_available() else "cpu"
+state_dict = load_file("checkpoints/model.safetensors")
+cfg = ModelConfig(
+    vocab_size=30522,
+    seq_len=128,
+    hidden_dim=512,
+    n_heads=8,
+    n_layer=3,
+    eps=1e-5,
+    ff_dim=2048,
+    dropout=0.1,
+)
+model = EmbeddingModel(cfg).to(device)
+model.load_state_dict(state_dict)
+model.eval()
+tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
+def embed(texts):
+    enc = tokenizer(
+        texts,
+        padding=True,
+        truncation=True,
+        max_length=128,
+        return_tensors="pt",
+    )
+    enc = {k: v.to(device) for k, v in enc.items()}
+    with torch.no_grad():
+        return model(enc["input_ids"], enc["attention_mask"])  # normalized
+def cosine_similarity(a, b):
+    ea = embed([a])[0]
+    eb = embed([b])[0]
+    return float((ea * eb).sum().item())
+```
+## Notes
+- Use the same tokenizer (`bert-base-uncased`) and the same `max_length=128` (or keep `seq_len` and preprocessing consistent).