jsanzolac/ga_wikipedia
Viewer • Updated • 6.41M • 27
GloVe word embeddings trained on English Wikipedia where each "word" is a
Qwen/Qwen3-Embedding-8B token id.
jsanzolac/ga_wikipedia (English Wikipedia dump 2023-11-01)Qwen/Qwen3-Embedding-8B (no special tokens)| File | Purpose |
|---|---|
vectors.txt |
GloVe text format: <qwen3_id> v1 v2 ... v512 |
vectors.bin |
Binary format (-binary 2) |
vocab.txt |
Qwen3 id and its corpus count |
token_id_to_string.json |
Mapping from Qwen3 id → decoded string |
import numpy as np
from transformers import AutoTokenizer
from huggingface_hub import hf_hub_download
vec_path = hf_hub_download("jsanzolac/qwen3_glove_512", "vectors.txt")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-Embedding-8B")
embeddings = {}
with open(vec_path) as f:
for line in f:
parts = line.rstrip().split(" ")
embeddings[int(parts[0])] = np.asarray(parts[1:], dtype=np.float32)
def embed(text):
ids = tok.encode(text, add_special_tokens=False)
return np.mean([embeddings[i] for i in ids if i in embeddings], axis=0)