## Acknowledgement Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉 ## What is this? This model is a finetuned version of [dbmdz/bert-base-turkish-cased](https://huggingface.co/dbmdz/bert-base-turkish-cased) to be used in zero-shot tasks in Turkish. It is finetuned with an NLI task by using `sentence-transformers` and uses `mean` of the token embeddings as the aggregation function. I also converted it to TensorFlow with the aggregation function rewritten in TF to use it in [my `ai-aas` repo on GitHub](https://github.com/monatis/ai-aas) for production-grade deployment, but a simple usage example is as follows: ## Usage ```python import time import tensorflow as tf from transformers import TFAutoModel, AutoTokenizer texts = ["Galatasaray, bu akşamki maçın ardından şampiyonluğunu ilan etmeye hazırlanıyor."] labels = ["spor", "siyaset", "kültür"] model_name = 'mys/bert-base-turkish-cased-nli-mean' tokenizer = AutoTokenizer.from_pretrained(model_name) model = TFAutoModel.from_pretrained(model_name) def label_text(model, tokenizer, texts, labels): texts_length = len(texts) tokens = tokenizer(texts + labels, padding=True, return_tensors='tf') embs = model(**tokens)[0] attention_masks = tf.cast(tokens['attention_mask'], tf.float32) sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True) masked_embs = embs * tf.expand_dims(attention_masks, axis=-1) masked_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32) dists = tf.experimental.numpy.inner(masked_embs[:texts_length], masked_embs[texts_length:]) scores = tf.nn.softmax(dists) results = list(zip(labels, scores.numpy().squeeze().tolist())) sorted_results = sorted(results, key=lambda x: x[1], reverse=True) sorted_results = [{"label": label, "score": f"{score:.4f}"} for label, score in sorted_results] return sorted_results start = time.time() sorted_results = label_text(model, tokenizer, texts, labels) alapsed = time.time() - start print(sorted_results) print(f"Processed in {alapsed:.2f} secs") ``` Output: ```shell [{'label': 'spor', 'score': '1.0000'}, {'label': 'siyaset', 'score': '0.0000'}, {'label': 'kültür', 'score': '0.0000'}] Processed in 0.22 secs ``` ## How it works `label_text()` function runs the BERT model with a concatenation of `texts` and `labels` as the input, and it agregates per-token hidden states outputted by the BERT model to produce a single vector per sequence. Then, the inner product of text embeddings and label embeddings is calculated as the similarity metric, and `softmax` is applied to convert these distance values to probabilities. ## Dataset >[Emrah Budur](https://scholar.google.com/citations?user=zSNd03UAAAAJ), [Rıza Özçelik](https://www.cmpe.boun.edu.tr/~riza.ozcelik), [Tunga Güngör](https://www.cmpe.boun.edu.tr/~gungort/) and [Christopher Potts](https://web.stanford.edu/~cgpotts). 2020. Data and Representation for Turkish Natural Language Inference. To appear in Proceedings of EMNLP. [[pdf]](https://arxiv.org/abs/2004.14963) [[bib]](https://tabilab.cmpe.boun.edu.tr/datasets/nli_datasets/nli-tr.bib) ``` @inproceedings{budur-etal-2020-data, title = "Data and Representation for Turkish Natural Language Inference", author = "Budur, Emrah and \"{O}z\c{c}elik, R{\i}za and G\"{u}ng\"{o}r, Tunga", booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics" } ```