Request the embedding model in Traditional Chinese

by phamvantoan - opened Nov 11, 2023

Nov 11, 2023

Hi,

Thank you very much for your contribution on Traditional Chinese LLM.

However, I'd like to ask if you can consider contributing an embedding model like "text2vec-large-chinese" or "bge-large-zh-v1.5" for Traditional Chinese?

I would appreciate much about that!

yentinglin

Owner Nov 12, 2023

Thanks for your kind word.

I did think about embedding model given most of our pretraining corpus which have structure information (e.g. title, context pairs) is well suited for training retrieval/embedding models. Also I used bge-zh in twllm.com for reranking search results which works quite well.

The current blocker is my bandwidth and low expectation for the impact.

If the open-source community could contribute some cases where current embedding model including OpenAI API, "text2vec-large-chinese" or "bge-large-zh-v1.5" failed in our language, I would have a better estimate of how I can contribute.

phamvantoan

Nov 13, 2023

Hi,

Thank you for your explanations. I will do testing with your model.

However, I just want to update one thing that I got from the authors of" bge-large-zh-v1.5". In fact, their embedding model was only trained on Simplified Chinese, instead of Traditional Chinese (https://huggingface.co/BAAI/bge-large-zh-v1.5/discussions/3#654b7ac2c5fd9382862542d4).

Therefore, I think "text2vec-large-chinese" might not supported too.

Moreover, related to Open AI embedding models, that is totally different because my purpose is to run all "inside" my PC.

yentinglin

Owner Nov 13, 2023

it seems like a good motivation for training traditional mandarin embedding models

phamvantoan

Nov 13, 2023

Hi,

Absolutely yes, just like it is kind of new idea for now.

However, I believe that lots of engineering labs in Taiwan will pay attentions about this model. Because it is really necessary.

phamvantoan

Nov 13, 2023

One more thing,

Is there any required prompt template for Taiwan-LLM-13B-v2.0-chat?

In my application, can I use the template as follows:

template = """You are a chatbot having a conversation with a human.

Given the following extracted parts of a long document and a question, create a final answer.

{context}

{chat_history}
Human: {human_input}
Chatbot:"""

yentinglin

Owner Nov 13, 2023

•

edited Nov 13, 2023

from transformers import AutoTokenizer
chat = [
  # {"role": "system", "content": "你講中文"},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
tokenizer = AutoTokenizer.from_pretrained("yentinglin/Taiwan-LLM-7B-v2.0.1-chat")
prompt_for_generation = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)