Copyright Concerns on Commercial Use of Distilled GPT Data. GPT 蒸馏数据用于商业用途的版权疑虑

#8
by JosephusCheung - opened

Requires multiple retries to reproduce.

image.png

Dutch translated into Chinese :

我是Yi,一个基于OpenAl平台GPT-3的01.AI人工智能模型。
我的工作是为您提供有用的信息并回答您的问题。
我随时准备为您提供帮助!

I have not tried out the Yi-34B-chat version. The original Yi would not answering like that, I am speculating it is due to the chat fine tuning is mostly sythsatic data collected from ChatGPT?

Hi @JosephusCheung

Thank you for your feedback.

  1. When performing a comprehensive web data crawl, it's inevitable to collect content generated by OpenAI, so the pre-trained Base model might produce answers like the one in your screenshot.
  2. Chat, as an SFT model, corrects identity information through alignment. However, since it is fine-tuned based on the Base model, there is still a slight possibility of generating responses like the ones in your screenshot; this also aligns with your comment "Requires multiple retries to reproduce."
  3. This phenomenon also occurs in other open source models (not naming any), and indeed, the industry as a whole currently needs to make improvements in this area.
  4. At this stage, the main task of our team is to make the model stronger. Data cleaning and correction is also an ongoing effort, and further iterations will be rolled out in phases.

In Chinese:

  1. 在进行全网数据爬取时,不可避免的会采集到OpenAI生成的内容,因此在预训练后的Base模型可能会发生您截图中的回答。
  2. Chat作为SFT模型,会通过对齐来修正身份信息,但由于是基于Base模型进行fine tune,仍在小概览会发生您截图中的回答;这也呼应了你的留言“要多试几次才能复现”。
  3. 这种现象也在其他开源大模型中发生(就不点名了),当前确实行业在这方面都有共同需要提升的空间。
  4. 现阶段团队的主要任务是把模型做得更强,数据清洗修正也是长期在投入的工作,后续会分阶段迭代。
JosephusCheung changed discussion status to closed

Sign up or log in to comment