Overview
This model is primarily designed for language understanding
between Chinese texts, as well as between Chinese and 9
non-Chinese languages. It utilizes the CoSENT training framework from text2vec
and fine tune the bert-base-multilingual-cased to achieve this functionality.
Download the model
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("Mike0307/text2vec-base-chinese-crosslingual")
model = AutoModel.from_pretrained("Mike0307/text2vec-base-chinese-crosslingual")
Example of similarity comparison
def mean_pooling(model_output, attention_mask):
token_embeddings = model_output[0]
input_mask_expanded = (
attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
)
return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(
input_mask_expanded.sum(1), min=1e-9
)
sentences = ["五隻小猴子在床上跳著", "Five little monkeys are jumping on the bed"]
encode_output = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt", max_length=512)
model_output = model(**encode_output)
embeddings = mean_pooling(model_output, encode_output['attention_mask'])
import torch
torch.cosine_similarity(embeddings[0], embeddings[1], dim=0)
# tensor(0.9518)
ISO-639 code | Support Language |
---|---|
zh-TW | Chinese (Traditional) |
zh-CN | Chinese (Simplified) |
nl | Dutch |
en | English |
fr | French |
de | German |
it | Italian |
pl | Polish |
pt | Portuguese (Portugal, Brazil) |
ru | Russian |
es | Spanish |
- Downloads last month
- 122