What pipeline are you using to show up all probabilities for 1 text?

#2
by AMKimia - opened

Im trying to find out all possible languges for a mixed language text (just as in your example) but im not being able to show more than only on of them, the most probable

If you'd like to see the probabilities for all possible languages, you might need to manipulate the raw output logits from the model.

#Install packages
!pip install transformers --quiet

#Import libraries
import torch
from transformers import pipeline, AutoTokenizer, AutoModelForSequenceClassification
import torch.nn.functional as F

#Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("ERCDiDip/langdetect")
model = AutoModelForSequenceClassification.from_pretrained("ERCDiDip/langdetect")

def classify_with_probabilities(text):
#Tokenize and get output logits
inputs = tokenizer(text, return_tensors='pt', truncation=True, max_length=512)
outputs = model(**inputs)
logits = outputs.logits
#Convert logits to probabilities
probs = F.softmax(logits, dim=1).squeeze().tolist()
#Pair with labels and sort
label_prob_pairs = list(zip(model.config.id2label.values(), probs))
label_prob_pairs.sort(key=lambda x: x[1], reverse=True)
return label_prob_pairs

#Use function
result = classify_with_probabilities("clemens etc dilecto filio scolastico ecclesie wetflari ensi treveren dioc salutem etc significarunt nobis dilecti filii commendator et fratres hospitalis beate marie theotonicorum")
print(result)
Out: [('la', 0.9999949932098389), ('mhd', 2.3282084384845803e-06), ('fnhd', 4.401515241170273e-07), ('mt', 1.912285796379365e-07), ('it', 1.8400885437586112e-07), ('gml', 1.6747328857036337e-07), ('sq', 9.396003264328101e-08), ('sl', 9.104244469426703e-08), ('ar', 8.333452683473297e-08), ('es', 7.961761383512567e-08), ('tr', 7.74629711486341e-08), ('fr', 7.255131606598297e-08), ('bg', 7.096467413703067e-08), ('de', 6.920742379179501e-08), ('pt', 6.821716880267559e-08), ('sk', 6.656565432194839e-08), ('he', 6.295641696851817e-08), ('ru', 5.417512127792179e-08), ('da', 5.0310834609490485e-08), ('fro', 5.0152941355463554e-08), ('sv', 4.9287699255273765e-08), ('se', 4.7595925423138397e-08), ('en', 4.286414423404494e-08), ('lv', 3.890071198497935e-08), ('uk', 3.8869192309221035e-08), ('lt', 3.5666676723167257e-08), ('ro', 3.513213542305493e-08), ('hr', 3.3360986861907804e-08), ('ca', 3.2240908609537655e-08), ('no', 3.004357651548162e-08), ('et', 2.9618950847520864e-08), ('pl', 2.9189688888209275e-08), ('grc', 2.3447864094805482e-08), ('fi', 2.2075809624766407e-08), ('el', 2.155294964722998e-08), ('hu', 1.8249098232558936e-08), ('cs', 1.665422466601285e-08), ('chu', 1.46433540848534e-08), ('nl', 1.0680254902695197e-08), ('eu', 7.244770117154076e-09), ('zh', 5.538413283545651e-09)]

ERCDiDip changed discussion status to closed

Thanks so so so much, i will do some testings and let you know!

Very welcome :) But probably the sliding window would be a better choice. The following code break the input text into overlapping segments (sliding windows) and uses our model to detect the language of each segment. It then counts and returns the N (n_most_common=2) most frequently detected languages from these segments.

!pip install transformers --quiet

import torch
from transformers import pipeline
from collections import Counter

classificator = pipeline("text-classification", model="ERCDiDip/langdetect")

def sliding_window(text, window_size=30, stride=15):
return [text[i:i+window_size] for i in range(0, len(text) - window_size + 1, stride)]

def detect_languages(text, n_most_common=2):
windows = sliding_window(text)
detected_languages = []
for window in windows:
result = classificator(window)
detected_languages.append(result[0]['label'])
# Count languages
language_counts = Counter(detected_languages)
# Get the most common N languages
common_languages = language_counts.most_common(n_most_common)
return common_languages

result = detect_languages("clemens etc dilecto filio scolastico ecclesie wetflari ensi treveren dioc salutem etc significarunt nobis dilecti filii commendator et fratres hospitalis beate marie theotonicorum. Anerkennung des Fürstenstandes der Gräfin Gertrude von Schaumburg und ihrer 9 Kinder aus der Ehe mit dem Kurfürsten und Landgrafen Friedrich Wilhelm von Hessen unter dem Titel Fürstinnen und Fürsten von Hanau im Kaiserreich Österreich unter Franz Joseph I. ")
print(result)
[('la', 9), ('de', 9)]

All the best!

ERCDiDip pinned discussion

let us give it a try and will let you know

Sign up or log in to comment