qwen3_glove_512

GloVe word embeddings trained on English Wikipedia where each "word" is a Qwen/Qwen3-Embedding-8B token id.

Training

  • Corpus: jsanzolac/ga_wikipedia (English Wikipedia dump 2023-11-01)
  • Tokenizer: Qwen/Qwen3-Embedding-8B (no special tokens)
  • Implementation: stanfordnlp/GloVe
  • Vector size: 512
  • Min vocab count: 1
  • Window size: 15
  • Iterations: 15
  • x_max: 10

Files

File Purpose
vectors.txt GloVe text format: <qwen3_id> v1 v2 ... v512
vectors.bin Binary format (-binary 2)
vocab.txt Qwen3 id and its corpus count
token_id_to_string.json Mapping from Qwen3 id → decoded string

Quick start

import numpy as np
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download

vec_path = hf_hub_download("jsanzolac/qwen3_glove_512", "vectors.txt")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-8B")

embeddings = {}
with open(vec_path) as f:
    for line in f:
        parts = line.rstrip().split(" ")
        embeddings[int(parts[0])] = np.asarray(parts[1:], dtype=np.float32)

def embed(text):
    ids = tok.encode(text, add_special_tokens=False)
    return np.mean([embeddings[i] for i in ids if i in embeddings], axis=0)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jsanzolac/qwen3_glove_512

Finetuned
(28)
this model

Dataset used to train jsanzolac/qwen3_glove_512