Edit model card

1. KoRnDAlpaca-Polyglot-12.8B (v1.3)

  • KoRnDAlpaca is based on Korean and fine-tuned with 1 million instruction data (R&D Instruction dataset v1.3) generated from Korean national research reports.
  • The base model of KoRnDAlpaca is EleutherAI/polyglot-en-12.8b.
  • For more information about the training procedure and model, please contact gsjang@kisti.re.kr.

2. How to use the model

from transformers import pipeline, AutoModelForCausalLM
import torch

LLM_MODEL = "NTIS/KoRnDAlpaca-Polyglot-12.8B"
query = "์ง€๋Šฅํ˜• ์˜์ƒ๊ฐ์‹œ ๊ธฐ์ˆ ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ตญ๋‚ด ๊ธฐ์—…์€?"

llm_model = AutoModelForCausalLM.from_pretrained(
    LLM_MODEL,
    device_map="auto",
    torch_dtype=torch.float16,
    low_cpu_mem_usage=True,
    #    load_in_8bit=True,
    #    revision="8bit"
)
pipe = pipeline(
    "text-generation",
    model=llm_model,
    tokenizer=LLM_MODEL,
    # device=2,
)

ans = pipe(
            f"### ์งˆ๋ฌธ: {query}\n\n### ๋‹ต๋ณ€:",
            do_sample=True,
            max_new_tokens=512,
            temperature=0.1,
            top_p=0.9,
            return_full_text=False,
            eos_token_id=2,
        )
msg = ans[0]["generated_text"]

if len(msg.split('###')[0]) > 0:
    output = msg.split('###')[0]
else:
    output = '๋‹ต๋ณ€์„ ๋“œ๋ฆฌ์ง€ ๋ชปํ•˜์—ฌ ์ฃ„์†กํ•ฉ๋‹ˆ๋‹ค.'

print(output)
# ๊ตญ๋‚ด ์ง€๋Šฅํ˜• ์˜์ƒ๊ฐ์‹œ ๊ธฐ์ˆ ์˜ ๋Œ€ํ‘œ์ ์ธ ๊ธฐ์—…์œผ๋กœ๋Š” ํ•œํ™” ํ…Œํฌ์œˆ์ด ์žˆ๋‹ค.

3. R&D Instruction Dataset v1.3

  • The dataset is built using 30,000 original research reports from the last 5 years provided by KISTI (curation.kisti.re.kr).
  • The dataset cannot be released at this time due to the licensing issues (to be discussed to release data in the future).
  • The process of building the dataset is as follows
    • A. Extract important texts related to technology, such as technology trends and technology definitions, from research reports.
    • B. Preprocess the extracted text
    • C. Generate question and answer pairs (total 1.5 million) based on the extracted text by using ChatGPT API(temporarily), which scheduled to be replaced with our own question&answer generation model(`23.11)
    • D. Reformat the dataset in the form of (Instruction, Output, Source). โ€˜Instructionโ€™ is the user's question, โ€˜Outputโ€™ is the answer, and โ€˜Sourceโ€™ is the research report identification code that the answer is based on.
    • E. Remove low-quality data by the data quality evaluation module. Use only high-quality Q&As for training. (1 million)
      • โ€ป In KoRnDAlpaca v2 (planned for `23.10), in addition to Q&A, the instruction dataset will be added to generate long-form technology trends.

4. Future plans

  • 23.10: Release KoRnDAlpaca v2 (adds the ability to generate long-form technology trend information in Markdown format)
  • 23.12: Release NITS-seachGPT module v1 (Retriever + KoRnDAlpaca v3)
    • โ€ป R&D-specific open-domain question answering module with "Retriever + Generator" structure
    • โ€ป NTIS-searchGPT v1 is an early edition, with anticipated performance improvements scheduled for 2024.
  • 23.12: KoRnDAlpaca v2 will be applied to the chatbot of NTIS (www.ntis.go.kr)

5. Date of last update

  • 2023.08.31

References

Downloads last month
10
Inference API
Model is too large to load in Inference API (serverless). To try the model, launch it on Inference Endpoints (dedicated) instead.