--- pipeline_tag: text-classification tags: - sentence-transformers - transformers - SetFit - News --- # IPTC topic classifier (multilingual) A SetFit model fit on 166 downlsampled multilingual IPTC Subject labels (concatenated for the lowest hierarchy level into artificial sentences of keywords) to predict the mid level news categories. The purpose of this classifier is to support exploring corpora as weak labeler, since the representations of these descriptions are only approximations of real documents from those topics. Accuracy on highest level labels in eval: 0.9779412 Accuracy/F1/mcc on mid level labels in eval: 0.6992481/0.6666667/0.6992617 More interestingly, I used the kaggle dataset with headlines from huffington post and manually selected 15 overlapping high level categories to evaluate the performance. https://www.kaggle.com/datasets/rmisra/news-category-dataset While mcc 0.1968043 on this dataset does not sound as good as before, the mistakes usually could also be seen as a re-interpretation. I.e. news on arrests where categorized as entertainment in the huffington post dataset, the classifier put it into the crime category. My current impression is this system is useful for the aimed for purpose. The numeric categories can be joined with the labels by using this table: https://huggingface.co/datasets/KnutJaegersberg/IPTC-topic-classifier-labels Looks like try out api box to the right by huggingface does not yet handle setfit models, can't do anything about that. Use like any other SetFit model from setfit import SetFitModel # Download from Hub and run inference model = SetFitModel.from_pretrained("KnutJaegersberg/IPTC-classifier-ml") # Run inference preds = model(["Rachel Dolezal Faces Felony Charges For Welfare Fraud", "Elon Musk just got lucky", "The hype on AI is different from the hype on other tech topics"])