constructai/taylor-m1
This model is a fine-tuned version of a custom BERT-like encoder (hidden_size=384, 6 layers) trained on MS MARCO passage ranking dataset.
It uses triplet hard negatives from sentence-transformers/msmarco-msmarco-distilbert-base-v3 (Apache-2.0 license). The base MS MARCO data is subject to Microsoft Research License.
The model produces 384-dimensional embeddings (CLS token) optimized for cosine similarity.
Training details
~22 417 920M parameters
Vocab Size: 30 522
Max sequence length: 128 tokens
Loss: MultipleNegativesSymmetricRankingLoss (InfoNCE)
Batch size: 128 (32 + gradient accumulation)
Learning rate: 2e-5
Data: ~500k triplets from MS MARCO
Evaluation
On a small test set, the model achieves:
Positive pair similarity: 0.58
Negative pair similarity: 0.14
Margin: 0.44
Usage
Option 1: Via custom Python package (recommended)
Install the package directly from GitHub:
pip install git+https://github.com/PSYCHOxSPEED/constructai-taylor-model
Then load the model and get embeddings:
from taylor_model import load_taylor_model, embed_texts
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer, _ = load_taylor_model("constructai/taylor-m1", device=device)
texts = ["What is a neural network?", "How to make pizza at home?"]
embeddings = embed_texts(model, tokenizer, texts, device=device)
print(embeddings.shape) # (2, 384)
Compute similarity between queries and documents:
from taylor_model import load_taylor_model, embed_texts
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
model, tokenizer, _ = load_taylor_model("constructai/taylor-m1", device=device)
queries = [
"What is a neural network?",
"How to make pizza at home?"
]
documents = [
"A neural network is a computing system inspired by biological neural networks.",
"For pizza you need dough, tomato sauce, mozzarella cheese and toppings."
]
q_emb = embed_texts(model, tokenizer, queries, device=device)
d_emb = embed_texts(model, tokenizer, documents, device=device)
similarities = q_emb @ d_emb.T
print("Similarity matrix:")
print(similarities)
Requirements:
- Python ≥ 3.9
- transformers ≥ 4.30.0
- torch ≥ 2.0.0
- huggingface_hub ≥ 0.20.0
- numpy
Model details
This model was fully created by me (Construct AI). I designed the architecture, trained a custom WordPiece tokenizer, pre‑trained the model with MLM, and fine‑tuned it on the MS MARCO passage ranking dataset for semantic search. No parts of this model have been taken from other pre‑trained checkpoints – it is built from scratch.
License
This model is released under the Apache 2.0 License.
I apologize if this model does not show the best quality or if you are unhappy with its maximum sequence length.
This is my first custom model, I tried not to do everything at once.
- Downloads last month
- 40