Request the embedding model in Traditional Chinese

#1
by phamvantoan - opened

Hi,

Thank you very much for your contribution on Traditional Chinese LLM.

However, I'd like to ask if you can consider contributing an embedding model like "text2vec-large-chinese" or "bge-large-zh-v1.5" for Traditional Chinese?

I would appreciate much about that!

Thanks for your kind word.

I did think about embedding model given most of our pretraining corpus which have structure information (e.g. title, context pairs) is well suited for training retrieval/embedding models. Also I used bge-zh in twllm.com for reranking search results which works quite well.

The current blocker is my bandwidth and low expectation for the impact.

If the open-source community could contribute some cases where current embedding model including OpenAI API, "text2vec-large-chinese" or "bge-large-zh-v1.5" failed in our language, I would have a better estimate of how I can contribute.

Hi,

Thank you for your explanations. I will do testing with your model.

However, I just want to update one thing that I got from the authors of" bge-large-zh-v1.5". In fact, their embedding model was only trained on Simplified Chinese, instead of Traditional Chinese (https://huggingface.co/BAAI/bge-large-zh-v1.5/discussions/3#654b7ac2c5fd9382862542d4).

Therefore, I think "text2vec-large-chinese" might not supported too.

Moreover, related to Open AI embedding models, that is totally different because my purpose is to run all "inside" my PC.

it seems like a good motivation for training traditional mandarin embedding models

Hi,

Absolutely yes, just like it is kind of new idea for now.

However, I believe that lots of engineering labs in Taiwan will pay attentions about this model. Because it is really necessary.

One more thing,

Is there any required prompt template for Taiwan-LLM-13B-v2.0-chat?

In my application, can I use the template as follows:

template = """You are a chatbot having a conversation with a human.

Given the following extracted parts of a long document and a question, create a final answer.

{context}

{chat_history}
Human: {human_input}
Chatbot:"""

from transformers import AutoTokenizer
chat = [
  # {"role": "system", "content": "你講中文"},
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
tokenizer = AutoTokenizer.from_pretrained("yentinglin/Taiwan-LLM-7B-v2.0.1-chat")
prompt_for_generation = tokenizer.apply_chat_template(chat, tokenize=False, add_generation_prompt=True)

Try this, it should work for all my models

re. embedding model, it would be great if industry labs could sponsor me :)

Hi,

However, in my case, I also need "context" and "chat_history" that are local data and history of chat, respectively.

Thus, could you please tell me how should I put "context" and "chat_history" into your prompt template?

Thank you for your help!

Yours template looks fine

Thank you for your answer!

Sign up or log in to comment