Edit model card

Overview

This model is primarily designed for language understanding between Chinese texts, as well as between Chinese and 9 non-Chinese languages. It utilizes the CoSENT training framework from text2vec and fine tune the bert-base-multilingual-cased to achieve this functionality.

Download the model

from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained("Mike0307/text2vec-base-chinese-crosslingual")
model = AutoModel.from_pretrained("Mike0307/text2vec-base-chinese-crosslingual")

Example of similarity comparison

def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)

sentences = ["五隻小猴子在床上跳著", "Five little monkeys are jumping on the bed"]

encode_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=512)
model_output = model(**encode_output)
embeddings = mean_pooling(model_output, encode_output['attention_mask'])


import torch
torch.cosine_similarity(embeddings[0], embeddings[1], dim=0) 
# tensor(0.9518)
ISO-639 code Support Language
zh-TW Chinese (Traditional)
zh-CN Chinese (Simplified)
nl Dutch
en English
fr French
de German
it Italian
pl Polish
pt Portuguese (Portugal, Brazil)
ru Russian
es Spanish
Downloads last month
122
Inference API
Inference API (serverless) does not yet support transformers models for this pipeline type.

Dataset used to train Mike0307/text2vec-base-chinese-crosslingual