will the training data or the information of data preparsion be released?
I wonder how the 600B code pretrain data consists.
I guess their pretrain data format is also based on Chatglm and llama。At least I can see this passage in the introduction of CodeGeex2
Here are some data formats based on llama
[
{
"instruction": "That sounds great. In what areas may artificial intelligence face challenges?",
"input": "",
"output": "The challenges faced by artificial intelligence include data privacy, security, and ethical issues, as well as automation issues that affect job opportunities.",
"history": [
["Hello, can you help me answer a question? ""Of course, what's the problem?"],
["I want to understand the future development direction of artificial intelligence. Do you have any ideas?", "The future development direction of industrial intelligence may include more powerful machine learning algorithms, more advanced natural language processing technologies, and more intelligent robots."]
]
}