view article Article An Analysis of Chinese LLM Censorship and Bias with Qwen 2 Instruct By leonardlin โข Jun 11 โข 48
view post Post 1369 Reply ๐ One of the biggest changes in Llama 3 was the training dataset, which grew by 7X over Llama 2 (2T to 15T tokens) ๐While Meta did not open source the dataset, it sparked a thought... what would happen if everyone had access to a big, high-quality dataset? ๐คTo address that, in April this year, @huggingface released FineWeb, a 15T token open-source dataset ๐And now they are releasing FineWeb Technical Report and FineWeb Edu ๐๐ 15T tokens in FineWeb outperforming other open datasets๐ 1.3T highest-quality educational dataset FineWeb-Edu๐ 5.4T high-quality educational tokens in FineWeb-Edu-2FineWeb Edu outperforms other datasets on MMLU, ARC, OpenBookQA ๐ODC-By 1.0 license ๐Report: HuggingFaceFW/blogpost-fineweb-v1 ๐ 3 3 +