Text Generation
Transformers
PyTorch
Thai
English
mpt
custom_code
text-generation-inference
WangchanLion7B / README.md
mrp's picture
Update README.md
c54d47b
metadata
license: apache-2.0
language:
  - th
  - en

Model Card for WangChanLion 7B - The Multilingual Instruction-Following Model

WangChanLion is a Multilingual, instruction-finetuned on Southeast Asian Languages SEA-LION 7B using open-source, commercially permissible datasets sample from LAION OIG chip2 and infill_dbpedia, DataBricks Dolly v2, OpenAI TL;DR, Hello-SimpleAI HC3, dolphin, iapp_wiki_qa_squad, thaisum, xlsum, scb_mt_enth_2020, han dataset, xp3x and Open-Platypus, a total of ~500k samples. Non-commercial datasets were filtered out. Released under apache 2.0 license. The models are trained to perform a subset of instruction-following tasks we found most relevant: reading comprehension, brainstorming, and creative writing. In this model, we focus on Thai and English datasets. We perform Vicuna-style evaluation using human evaluation. In a similar manner to Dolly v2, we only use open-source, commercially permissive pretrained models and datasets. Our models are neither restricted by non-commercial clauses like LLaMA-based models nor non-compete clauses like models that use self-instruct datasets from ChatGPT.

  • Developers: PyThaiNLP and VISTEC-depa AI Research Institute of Thailand
  • Model type: SEA-LION 7B (MPT architecture)

Model Sources

Direct Use

Intended to be used as an instruction-following model for reading comprehension, brainstorming, and creative writing.

Downstream Use

The model can be finetuned for any typical instruction-following use cases.

Out-of-Scope Use

We do not expect the models to perform well in math problems, reasoning, and factfulness.

Bias, Risks, and Limitations

We noticed similar limitations to other finetuned instruction followers, such as math problems, reasoning, and factfulness. Even though the models do not perform on the level that we expect them to be abused, they do contain undesirable biases and toxicity and should be further optimized for your particular use cases.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases, and limitations of the model. More information is needed for further recommendations.

How to Get Started with the Model

Use the code here below to get started with the model.

Or

from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained( "airesearch/WangchanLion7B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    "airesearch/WangchanLion7B", trust_remote_code=True,
    return_dict=True,
    load_in_8bit=True ,
    device_map="auto",
    torch_dtype=torch.float16,
    offload_folder="./",
    low_cpu_mem_usage=True,
)
def get_prompt(question: str,context: str = None) -> str:
    if context is not None:
      return """พื้นหลัง:\n\n{context}\n\nคำถาม:{question}\n\nตอบ:""".format(context=context, question=question)
    return """คำถาม:{question}\n\nตอบ:""".format(question=question)

question = "เกิดอะไรขึ้นที่เทียนอันเหมินตอนปี 1989"
full_prompt = get_prompt(question=question)
tokens = tokenizer(full_prompt, return_tensors="pt").to("cuda")
output = model.generate(
    input_ids=tokens['input_ids'],
    attention_mask=tokens['attention_mask'],
    max_new_tokens=256,
    early_stopping=True,
    top_k=50, top_p=0.95,
    do_sample=True,
    temperature=0.3,
    repetition_penalty = 1.2,
    eos_token_id = tokenizer.eos_token_id,
)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Details

Training Data

Finetuning datasets are sourced from LAION OIG chip2 and infill_dbpedia (Apache-2.0), DataBricks Dolly v2 (Apache-2.0), OpenAI TL;DR (MIT), Hello-SimpleAI HC3 (CC-BY SA), dolphin, iapp_wiki_qa_squad , thaisum, xlsum, scb_mt_enth_2020, han dataset, xp3x and Open-Platypus.

Training regime

  • QLoRA with 4 A100 (40GB)

Evaluation

We performed human and machine evaluations on XQuAD zero-shot and one-shot settings:

XQuAD

Model Exact Match (Zero-shot) F1 (Zero-shot) Exact Match (One-shot) F1 (One-shot)
openthaigpt7B 18.5714 28.4002 30.4202 39.7556
SeaLLM7B - - - 44.43
Typhoon-7b 23.8655 36.27 46.7227 57.898
WangchanLion7B 37.563 49.8432 39.2437 51.0627

iAPP Wiki QA

Model Exact Match (Zero-shot) F1 (Zero-shot) Exact Match (One-shot) F1 (One-shot)
openthaigpt7B 22.0568 40.0696 31.3938 47.9775
SeaLLM7B 8.2544 34.4038 40.0541 58.2673
Typhoon-7b 27.3342 46.2938 43.3018 59.9434
WangchanLion7B 55.4804 67.9262 56.4276 68.8471

What WangchanLion offers:

  • Transparent pretrained model: The development of SEA-LION is community-driven, with different ASEAN collaborators contributing pretraining datasets. The SEA-LION developers ensure that all datasets are safe and can be utilized without commercial restrictions. This transparency extends to the provision of pretraining code, ensuring anyone can replicate SEA-LION using the provided datasets.
  • Transparent finetuning data: In the spirit of open science, we make the finetuning data for WangchanLion accessible to all. This commitment to openness empowers the community by providing complete visibility into the instruction finetuning data that shapes WangchanLion.
  • Transparent finetuning code: The finetuning code for WangchanLion is readily available for distribution. By sharing our methods and processes, we invite others to learn from, build upon, and innovate alongside us.