WordLLama - Indic

Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli to train this model.

Weights and tokenizer is dereived from sarvam-1, For license terms refer to https://huggingface.co/sarvamai/sarvam-1.

How to use.

Install fork of WordLlama, pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git

Download the weights and tokenizer, git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic

Code can be used like this,

from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama
from safetensors  import safe_open
import toml
from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("tokenizer.json")
f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu")
embedding = f.get_tensor('embedding.weight').numpy()

config_file = "sarvam1_2b.toml"
config_data = toml.load(config_file)
config_data["config_name"] = "sarvam1_2b"
config = WordLlamaConfig(**config_data)

wl = WordLlamaInference(
        embedding=embedding,
        tokenizer=tokenizer,
        config=config,
        binary=False,
)

# Calculate similarity between two sentences
similarity_score = wl.similarity("I went to the car", "I went to the pawn shop")
print(similarity_score)  # Output: e.g., 0.0664

# Rank documents based on their similarity to a query
query = "I went to the car"
candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)


# Calculate similarity between two sentences in Tamil
similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்")
print(similarity_score)  # Output: e.g., 0.075

# Rank documents based on their similarity to a Tamil query
query = "நான் கார் சென்றேன்"
candidates = [
    "நான் பூங்காவிற்கு சென்றேன்", 
    "நான் கடைக்கு சென்றேன்", 
    "நான் லாரி சென்றேன்", 
    "நான் வாகனத்தில் சென்றேன்"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)

query = "నేను కారులో వెళ్లాను"
candidates = [
    "నేను పార్క్‌కి వెళ్లాను",
    "నేను మార్కెట్‌కి వెళ్లాను",
    "నేను లారీలో వెళ్లాను",
    "నేను వాహనంలో వెళ్లాను"
]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for tinisoft/wordllama-indic

Base model

sarvamai/sarvam-1
Finetuned
(4)
this model