--- language: - en - ta - ml - as - bn - gu - hi - kn - mr - or - te base_model: - sarvamai/sarvam-1 pipeline_tag: sentence-similarity --- # WordLLama - Indic Inspired by WordLLama, trained using word embeddings of Saravam-1 models that supports most Indic languages. We used translated subset of https://huggingface.co/datasets/sentence-transformers/all-nli to train this model. Weights and tokenizer is dereived from sarvam-1, For license terms refer to https://huggingface.co/sarvamai/sarvam-1. ## How to use. Install fork of WordLlama, ```pip install -e wordllama @ git+https://github.com/tinisoft/WordLlama.git``` Download the weights and tokenizer, ```git clone https://huggingface.co/tinisoft/wordllama-indic && cd wordllama-indic``` Code can be used like this, ``` from wordllama import WordLlamaInference, WordLlamaConfig, WordLlama from safetensors import safe_open import toml from tokenizers import Tokenizer tokenizer = Tokenizer.from_file("tokenizer.json") f = safe_open("sarvam1_2b_128.safetensors", framework="pt", device="cpu") embedding = f.get_tensor('embedding.weight').numpy() config_file = "sarvam1_2b.toml" config_data = toml.load(config_file) config_data["config_name"] = "sarvam1_2b" config = WordLlamaConfig(**config_data) wl = WordLlamaInference( embedding=embedding, tokenizer=tokenizer, config=config, binary=False, ) # Calculate similarity between two sentences similarity_score = wl.similarity("I went to the car", "I went to the pawn shop") print(similarity_score) # Output: e.g., 0.0664 # Rank documents based on their similarity to a query query = "I went to the car" candidates = ["I went to the park", "I went to the shop", "I went to the truck", "I went to the vehicle"] ranked_docs = wl.rank(query, candidates) print(ranked_docs) # Calculate similarity between two sentences in Tamil similarity_score = wl.similarity("நான் கார் சென்றேன்", "நான் கடைக்கு சென்றேன்") print(similarity_score) # Output: e.g., 0.075 # Rank documents based on their similarity to a Tamil query query = "நான் கார் சென்றேன்" candidates = [ "நான் பூங்காவிற்கு சென்றேன்", "நான் கடைக்கு சென்றேன்", "நான் லாரி சென்றேன்", "நான் வாகனத்தில் சென்றேன்" ] ranked_docs = wl.rank(query, candidates) print(ranked_docs) query = "నేను కారులో వెళ్లాను" candidates = [ "నేను పార్క్‌కి వెళ్లాను", "నేను మార్కెట్‌కి వెళ్లాను", "నేను లారీలో వెళ్లాను", "నేను వాహనంలో వెళ్లాను" ] ranked_docs = wl.rank(query, candidates) print(ranked_docs) ```